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(54) Microprocessor with improved instruction set architecture 



(57) A data processing system on an integrated cir- 
cuit 42 with microprocessor 1 and peripheral devices 
60-61 is provided with an emulation unit 50 which allows 
debugging and emulation of integrated circuit 42 when 
connected to an external test system 51. Microproces- 
sor 1 has in instruction execution pipeline that has sev- 
eral execution phases that involve fetch/decode units 
lOa-c and functional execution units 12, 14, 16 and 18. 
The pipeline of microprocessor 1 is unprotected so that 
memory access latency to data memory 22 and register 
file 20 can be utilized by system program code which is 
stored in instruction memory 23. Multi-field arithmetic/ 
logic unit (ALU) circuitry (L1 . L2; S1 , 82) is provided for 
operating on a set of source operands to form a multi- 
field destination operand by treating the plurality of 
source operands as a set of N1 fields, such that the mul- 
ti-field result includes N1 results corresponding to the 
set of N1 fields. Multi-field multiplication circuitry (M1, 
M2) is provided for operating on a set of source oper- 
ands to form a multi-field destination operand by treating 
the plurality of source operands as a set of N2 fields, 
such that the multi-field result includes N2 results cor- 
responding to the set of N2 fields. An instruction set ar- 
chitecture (ISA) is provided that is optimized for inten- 
sive numeric algorithm processing and includes a set of 
single instruction, multiple data (SIMD) instructions to 
direct the operation of the multi-field ALU circuitry and 
the multi-field multiplication circuitry. Non-aligned data 



transfer to data memory (D1. D2. 22) can be performed 
for byte, half word, word, and double-word data items. 
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Description 

[0001] This invention relates to data processing devices, electronic processing and control systems and methods of 
their manufacture and operation, and particularly relates to microprocessors optimized for digital signal processing. 
5 [0002] Generally, a microprocessor is a circuit that combines the instruction-handling, arithmetic, and logical oper- 
ations of a computer on a single semiconductor integrated circuit. Microprocessors can be grouped into two general 
classes, namely general-purpose microprocessors and special-purpose microprocessors. General-purpose microproc- 
essors are designed to be programmable by the user to perform any of a wide range of tasks, and are therefore often 
used as the central processing unit (CPU) in equipment such as personal computers. Special-purpose microprocessors 
10 ,n contrast, are designed to provide performance improvement for specific predetermined arithmetic and logical func- 
tions for which the user intends to use the microprocessor. By knowing the primary function of the microprocessor, the 
designer can structure the microprocessor architecture in such a manner that the performance of the specific function 
by the special-purpose microprocessor greatly exceeds the performance of the same function by a general-purpose 
microprocessor regardless of the program implemented by the user. 

[0003] One such function that can be performed by a special-purpose microprocessor at a greatly improved rate is 
digital signal processing. Digital signal processing generally involves the representation, transmission, and manipula- 
tion of signals, using numerical techniques and a type of special-purpose microprocessor known as a digital signal 
processor (DSP). Digital signal processing typically requires the manipulation of large volumes of data, and a digital 
signal processor is optimized to efficiently perform the intensive computation and memory access operations associated 
with this data manipulation. For example, computations for performing Fast Fourier Transforms (FFTs) and for imple- 
menting digital filters consist to a large degree of repetitive operations such as multiply-and-add and multiple-bit-shift 
DSPs can be specifically adapted for these repetitive functions, and provide a substantial performance improvement 
over general-purpose microprocessors in. for example, real-time applications such as image and speech processing. 
[0004] DSPs are central to the operation of many of today's electronic products, such as high-speed modems, high- 
density disk drives, digital cellular phones, complex automotive systems, and video-conferencing equipment DSPs 
will enable a wide variety of other digital systems in the future, such as video-phones, network processing, natural 
speech interfaces, and ultra-high speed modems. The demands placed upon DSPs in these and other applications 
continue to grow as consumers seek increased performance from their digital products, and as the convergence of 
the communications, computer and consumer industries creates completely new digital products. 
30 [0005] Microprocessor designers have increasingly endeavored to exploit parallelism to improve performance. One 
parallel architecture that has found application in some modern microprocessors utilizes multiple instruction fetch pack- 
ets and multiple instruction execution packets with multiple functional units. 

[0006] Digital systems designed on a single integrated circuit are referred to as an application specific integrated 
circuit (ASIC). MegaModules are being used in the design of ASlCsto create complex digital systems a single chip 

35 (MegaModule is a trademark of Texas Instruments Incorporated.) Types of MegaModules include SRAMs. FIFOs, 
register files. RAMs. ROMs, universal asynchronous receiver-transmitters (UARTs). programmable logic arrays and 
other such logic circuits. MegaModules are usually defined as integrated circuit modules of at least 500 gates in com- 
plexity and having a complex ASIC macro function . These MegaModules are predesigned and stored in an ASIC design 
library. The MegaModules can then be selected by a designer and placed within a certain area on a new IC chip. 

40 [0007] Designers have succeeded in increasing the performance of DSPs, and microprocessors in general, by in- 
creasing clock speeds, by removing data processing bottlenecks in circuit architecture, by incorporating multiple exe- 
cution units on a single processor circuit, and by developing optimizing compilers that schedule operations to be exe- 
cuted by the processor in an efficient manner The increasing demands of technology and the marketplace make 
desirable even further structural and process improvements in processing devices, application systems and methods 

45 of operation and manufacture. 

[0008] The invention provides a microprocessor having an instruction execution pipeline with a plurality of pipeline 
phases, comprising: 

[0009] program fetch circuitry operable to fetch instructions: instruction decode circuitry connected to receive fetched 
instructions from the program fetch circuitry, the instruction decode circuitry operable to decode a set of single instruc- 
50 tion. multiple data (SIMD) instructions: and 

at least a first functional unit connected to receive control signals from the instruction decode circuitry, wherein the 
first functional unit comprises: 

multi-field arithmetic/logic unit (ALU) circuitry connected to receive a plurality of source operands .^nd having out- 
puts connected to provide a multi-field result as a destination operand in response to control ^i-jnals from the 
instruction decode circuitry, wherein the multi-field ALU circuitry is operable to treat the plurality of ire- operands 
as a set of N1 fields, such that the multi-field result includes N1 results corresponding to the ser \ i fields: and 
wherein the first functional unit is operable to provide the multi-field result in response to a singk - ' 'O .nstruction 
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of a first type. 

[0010] Preferably, the microprocessor further comprises a second functional unit connected to operate in parallel 
with the first functional unit, wherein the second functional unit comprises multi-field multiplication circuitry connected 

5 to receive a plurality of source operands and having outputs connected to provide a multi-field result as a destination 
operand in response to control signals from the instruction decode circuitry wherein the multi-field multiplication circuitry 
is operable to treat the plurality of source operands as a set of N2 fields, such that the multi-field result includes N2 
results corresponding to the set of N2 fields; and 

wherein the second functional unit is operable to provide the multi-field result in response to a single SIMD 

10 instruction of a second type. 

[0011] Preferably the microprocessor further comprises: 

at least a first load/store unit connected to at least a first storage register of the microprocessor; 
a memory subsystem having at least a first memory port connected to the first load/store unit; 
15 address generation circuitry in the first load/store unit having a first address output connected to the first memory 

port, the address generation circuitry operable to provide a first byte address on the first address output; 
an extraction circuit connected to the first memory port, wherein the extraction circuit is operable to provide a first 
non-aligned multi-byte data item to the first toad/store unit responsive to the first byte address; and wherein the 
load/store unit is operable to provide the multi-byte data item in response to a single instruction of a third type. 

20 

[0012] Preferably the first storage register Is part of a register file connected to the first functional unit and to the 
second functional unit for providing the plurality of source operands and connected to the first functional unit and to 
the second function unit to receive the destination operands. 

[0013] Preferably, the instructions of the first type, second type and third type all have a field for identifying a predicate 
25 register. ; 

[0014] Preferably, the microprocessor is included in a digital system and. preferably, the digital system is a cejiular 
telephone, further comprising: 

an integrated keyboard connected to the CPU via a keyboard adapter; ' 
30 a display connected to the CPU via a display adapter; 

radio frequency (RF) circuitry connected to the CPU; and 
an aerial connected to the RF circuitry. 

[0015] The invention provides a method of operating a microprocessor, comprising the steps of: 

35 

providing multi-field arithmetic/logic unit (ALU) circuitry for operating on a set of source operands to form a multi- 
field destination operand by treating the plurality of source operands as a set of Nl fields, such that the multi-field 
result includes Nl results corresponding to the set of Nl fields; and 

providing a set of single instruction, multiple data (SIMD) instructions to direct the operation of the multi-field ALU 
40 circuitry. 

[0016] Further, a method of operating a microprocessor, comprises the steps of: 

providing multi-field arithmetic/logic unit (ALU) circuitry for operating on a set of source operands to form a multi- 
45 field destination operand by treating the plurality of source operands as a set of N1 fields, such that the multi-field 

result includes N1 results corresponding to the set of N1 fields; 

providing multi-field multiplication circuitry for operating on a set of source operands to form a multi-field destination 
operand by treating the plurality of source operands as a set of N2 fields, such that the multi-field result includes 
N2 results corresponding to the set of N2 fields; and 
50 providing a set of single instruction, multiple data (SIMD) instructions to direct the operation of the multi-field ALU 

circuitry and the multi-field multiplication circuitry 

[0017] Preferably, the method further comprises the step of providing load/store circuitry operable to transfer a multi- 
byte data item to a memory circuit at a byte address that is not word aligned. 
55 [0018] An illustrative embodiment of the present invention seeks to provide a microprocessor, and a method for 
operating a microprocessor that improves digital signal processing performance. Aspects of the invention ,ire specified 
in the claims. 

[0019] In an embodiment of the present invention, a digital signal processor is provided which h--.- itr.provements 
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over prior art devices including: register file enhancennents, data path extensions, additional functional unit hardware, 
increased orthogonality of the instruction set, data flow enhancements/SIMD operation, and additional instructions that 
reduce code size and increase register flexibility. 

[0020] In an embodiment of the present invention, data flow enhancements include: an extensive collection of PACK 
and byte shift instructions which simplifies manipulation of packed data types; Single Instruction. Multiple Data (SIMD) 
instructions that operate directly on packed data to streamline data flow and increase instruction set efficiency, including: 

Quad 8x8 multiply operations 
Paired 16x 16 multiply operations 
Quad 8x8 multiply with add operations 
Paired 16x16 multiply with add/subtract operations 
Quad 8-bit arithmetic operations 
Paired 16-bit arithmetic operations 
Quad 8-bit min/max operations 
Paired 16-bit min/max operations 
Quad 8-bit compare operations 
Paired 16-bit compare operations 
Paired 16-bit shift operations 

[0021] Other features and advantages of the present invention will become apparent by reference to the following 
detailed description when considered in conjunction with the accompanying drawings, in which: 

Figure 1 is a block diagram of a digital signal processor (DSP), showing components thereof pertinent to an em- 
bodiment of the present invention; 

Figure 2A-2E is a block diagram of the functional units, data paths and register files of Figure 1; 
Figure 3A shows the addressing mode register (AMR) of the DSP of Figure 1 ; 

Figure 3B shows the control status register (CSR) which contains control and status bits of the DSP of Figure 1; 
Fig. 3C shows an El Phase Program Counter {PCE1) register of the DSP; 
Figure 4 shows a Galois Field Polynomial Generator Function Register 

Figure 5 depicts a general-purpose input register (IN) which supports 32 general-purpose input signals of the DSP 
of Figure 1; 

Figure 6 depicts a general-purpose output register (OUT) which supports 32 general-purpose output siqnals of 
the DSP of Figure 1; 

Figure 7 illustrates the register storage scheme for 40-bit data of the DSP of Figure 1 : 
35 Figures 8A-8J show an opcode map for the DSP of Figure 1 ; 

Figure 9A shows the basic format of a fetch packet of the DSP of Figure 1 ; 
Figure 9B depicts a fetch packet of Figure 9A with fully serial p-bits; 
Figure 9C depicts a fetch packet of Figure 9A with fully parallel p-bits; 
Figure 9D depicts a fetch packet of Figure 9A with partially serial p-bits; 

Figures 10A and 10B illustrate aspects of non-aligned address formation and non-aligned data extraction from a 
circular buffer region; 

Figure 11 shows the phases of the pipeline of the DSP of Figure 1; 
Figure 12 shows the branch instruction phases; 

Figure 1 3 shows the operation of the pipeline of the DSP of Figure 1 based on clock cycles and fetch packets; 
Figure 14 depicts fetch packet n. which contains three execute packets, shown followed by six fetch packets (n+1 
through n+6), each with one execution packet (containing 8 parallel instructions); 
Figure 1 5 is a block diagram of an MTAP to Test Port Interface for the processor of Figure 1 ; 
Figure 16A is an illustration of fetch packets and execution packets in a prior art processor; 
Figure 168 is an illustration of execution packets spanning fetch packets for the processor of Figure 1; and 
Figure 1 7 is a block diagram of an alternative embodiment of the processor of Figure 1 . 
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[0022] Figure 1 is a block diagram of a microprocessor 1 which has an embodiment of the present invention. Mic 
processor 1 is a VLIW digital signal processor fDSP"). In the interest of clarity. Figure 1 only shows those portions of 
microprocessor 1 that are relevant to an understanding of an embodiment of the present invention. Details of general 
construction for DSPs are well known, and may be found readily elsewhere. For example. U.S. Patent 5.072,41 8 issued 
to Frederick Boutaud, et ai. describes a DSP in detail and is incorporated herein by reference. U.S. Patent 5.329.471 
issued to Gary Swoboda. et al. describes in detail how to test and emulate a DSP and is incorporated herein by refer- 
ence. Details of portions of microprocessor 1 relevant to an embodiment of the present invention are explained in 
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sufficient detail hereinbetow. so as to enable one of ordinary skill in the microprocessor art to nnake and use the inven- 
tion. 

[0023] In microprocessor 1 there are shown a central processing unit (CPU) 10. data memory 22. program memory 
23, peripherals 60 and an external memory interface (EMIF) with a direct memory access (DMA) 61 . CPU 10 further 

5 has an instruction fetch/decode unit 1 0a-c. a plurality of execution units, including an arithmetic and load/store unit D 1 , 
a multiplier M1. an ALU/shifter unit S1, an arithmetic logic unit ("ALU") L1, a shared multlport register file 20a from 
which data are read and to which data are written. Instructions are fetched by fetch unit 10a from instruction memory 
23 over a set of busses 41. Decoded instructions are provided from the instruction fetch/decode unit lOa-c to the 
functional units D1 , Ml . SI , and L1 over various sets of control lines which are not shown. Data are provided to/from 

10 the register file 20a from/to to load/store units D1 over a first set of busses 32a. to multiplier Ml over a second set of 
busses 34a. to ALU/shifter unit SI over a third set of busses 36a and to ALU L1 over a fourth set of busses 38a. Data 
are provided to/from the memory 22 from/to the load/store units D1 via a fifth set of busses 40a. Note that the entire 
data path described above is duplicated with registerfile 20b and execution units D2, M2. S2, and L2. In this embodiment 
of the present invention, two unrelated aligned double word (64 bits) load/store transfers can be made In parallel 

15 between CPU 10 and data memory 22 on each clock cycle using bus set 40a and bus set 40b. 

[0024] A single non-aligned double word load/store transfer is performed by scheduling a first .D unit resource and 
two load/store ports on memory 22. Advantageously, an extraction circuit is connected to the memory subsystem to 
provide a non-aligned data item extracted from two aligned data items. requested by the .D unit. Advantageously, a 
second .D unit can perform 32-bit logical or arithmetic instructions In addition to the .S and .L units while the address 

20 port of the second .D unit is being used to transmit one of two contiguous addresses provided by the first .D unit. 
Furthermore, a non-aligned access near the end of a circular buffer region in the target memory provides a non-aligned 
data item that wraps around to the other end of the circular buffer 

[0025] Emulation circuitry 50 provides access to the internal operation of-integrated circuit 1 that can be controlled 
by an external test/development system (XDS) 51 . External test system 51 is representative of a variety of known test 
25 systems for debugging and emulating integrated circuits. One such system is described in U.S. Patent 5.535,331. 
which is incorporated herein by reference. Test circuitry 52 contains control registers and parallel signature analysis 
circuitry for testing integrated circuit 1 . 

[0026] Note that the memory 22 and memory 23 are shown in Figure 1 to be a part of a microprocessor 1 integrated 
circuit, the extent of which Is represented by the box 42. The memories 22-23 could just as well be external to the 
30 microprocessor 1 integrated circuit 42. or part of it could reside on the integrated circuit 42 and part of it be external 
to the integrated circuit 42. These are matters of design choice. Also, the particular selection and number of execution 
units are a matter of design choice, and are not critical to the invention. 

[0027] When microprocessor 1 is incorporated in a data processing system, additional memory or peripherals may 
be connected to microprocessor 1 , as illustrated In Figure 1 . For example. Random Access Memory (RAM) 70. a Read 

35 Only Memory (ROM) 71 and a Disk 72 are shown connected via an external bus 73. Bus 73 Is connected to the External 
Memory Interface (EMIF) which is part of functional block 61 within microprocessor 1 . A Direct Memory Access (DMA) 
controller is also included within block 61 . The DMA controller is generally used to move data between memory and 
peripherals within microprocessor 1 and memory and peripherals which are external to microprocessor 1. 
[0028] In the present embodiment. CPU core 10 Is encapsulated as a MegaModule, however, other embodiments 

40 of the present invention may be In custom designed CPU's or mass market microprocessors, for example. 

[0029] A detailed description of various architectural features of the microprocessor of Figure 1 is provided in coas- 
slgned application Serial No. 09/012,813 (TI-25311) and is incorporated herein by reference. A description of a portion 
of the instruction set not described herein for the microprocessor of Figure 1 Is also provided in coassigned application 
Serial No. 09/012,813 (TI-25311) and is incorporated herein by reference. 

45 [0030] Figure 2A is a block diagram of the execution units and register files of the microprocessor of Figure 1 and 
shows a more detailed view of the buses connecting the various functional blocks. In this figure, all data busses are 
32 bits wide, unless otherwise noted. There are two .general-purpose register files (A and B) in the processor's data 
paths. Each of these files contains 32 32-bit registers (A0-A31 for file A and BO-B31 for file B). The general-purpose 
registers can be used for data, data address pointers, or condition registers. Any number of reads of a given register 

50 can be performed in a given cycle. 

[0031] The general-purpose register files support data ranging in size from packed 8-bit data through 64-bit fixed- 
point data. Values larger than 32 bits, such as 40-bit long and 64-bit double word quantities, are stored in register pairs, 
with the 32 LSBs of data placed in an even-numbered register and the remaining 8 or 32 MSBs in the next upper 
register (which is always an odd-numbered register). Packed data types store either four 8-bit values or two 16-bit 

55 values in a single 32-bit register 

[0032] There are 32 valid register pairs for40-bit and 64-bit data, as shown in Table 1. In assembly l.m- ju.-iqe syntax, 
a colon between the register names denotes the register pairs and the odd numbered register is hn' -i-^-l m the in- 
struction opcode. 
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40-Bit/64-Bit Register Pairs 


Register Files 


A 


B 


A1:A0 


B1:B0 


A3:A2 


83:82 


A5:A4 


85:84 


A7:A6 


67:B6 


A9:A8 


B9:B8 


A11:A10 


611:810 


A13:A12 


B13:B12 


A15:A14 


815:814 


A17:A16 


B17:B16 


A19:A18 


819:818 


A21:A20 


821:820 


A23:A22 


823:822 


A25:A24 


825:824 


A27:A26 


B27:B26 


A29;A28 


829:828 


A31:A30 


831:830 



S^SBs'^^Z ' n '"^'T' OP-^-«-« requiring a long input ignore the 

24 MSBs of the odd register. Operations producing a long result zero-fill the 24 MSBs of the odd register The even 
register is encoded in the opcode, i«yit.ier. i ne even 

[0034] The eight functional units in processor 10's data paths are be divided into two groups of four- each functional 
det^d in SL' 2 ' '° in the other data ^ath.^The functional un^Te' 

fLtr^^r'"^" "S"^ ^"'^ *° ^^"^"'^ ^^"^'^ '^^'^ manipulations, processor 10 also contains many 8-bit and 16-bit 
data mstrucfons -n the instruction set. For example, the MPYU4 instruction performs four 8x8 unsigned multiplies with 
a single instruction on a M unit. The ADD4 instruction performs four 8-bit additions with a single ins.mction on a L un 



Table 2. 



Functional Units and Operations Performed 


Functional Unit 
Operations 


1 Fixed-Point 






• L unit (.LI, .L2) 

arithnnetic and connpare operations 


32/40-bit 


logical operations 


32-bit 


Leftmost 1 or 0 counting for 32 bits 




Normalization count for 32 and 40 bits 




shifts 


Byte 


packing/unpacking 


Data 


constant generation 


5-bit 



6 
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Table 2. (continued) 






Functional Units and Operations Perfornned 


5 


Functional Unit 
Operations 


Fixed-Point 




16-bit arithnnetic operations 
bit arithnnetic operations 


Paired 
Quad 8- 


10 


16-bit min/nnax operations 
hit min/max ooerations 


Paired 
Quad 8- 










.S unit (.S1. .32) 
arithmetic operations 


32-bit 


15 


bit shifts and 32-bit bit-field operations 


32/40- 




logical operations 


32-bit 




Branches 




20 


Constant generation 


Register transfers to/from control register file (.32 only) 




shifts 


Byte 




packing/unpacking 


Oata 


25 


16-bit compare operations 
8-bit compare operations 


Paired 
Quad 


30 


16-bit shift operations 

16-bit saturated arithmetic operations 

8-bit saturated arithmetic operations 


Paired 
Paired 
Quad 








.M unit (.Ml, .M2) 
operations 


16x16 multiply 


35 


multiply operations 


16 X 32 




expansion 


Bit 




interleaving/de-interleaving 


Bit 


40 


X 8 multiply operations 


Quad 8 




16x16 multiply operations 


Paired 




16 X 16 multiply with add/subtract operations 


Paired 
Quad 8 


45 


X 8 multiply with add operations 






Variable shift operations 

Rotation 

Field Multiply 


Galois 


50 








.D unit (.01. .02) 

subtract, linear and circular address calculation 


32-bit add, 




and stores with 5-bit constant offset 


Loads 


55 


and stores with 15-bit constant offset (.02 only) 


Loads 




and store double words with 5-bit constant 


Load 
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Table 2. (continued) 


Functional Units and Operations Performed 


Functional Unit 


Fixed-Point 


Operations 




and store non-aligned words and double words 


Load 


constant generation 


5-bit 


bit logical operations 


32- 



ln.r!nHc P tT T ^P^J Support 32-bit operands, and some support long (40-bit) and double word (64-bit) 
operands. Each functional unit has its own 32-bit write port into a general-purpose register file (Refer to Figure 2A) 
All units ending in 1 (for exannple, .L1) write to register file A 20a and all units ending in 2 write to register file B 20b 
Each functional unit has two 32-bit read ports for source operands srcl and src2. Four units ( L1 L2 SI and S2) 
have an extra 8-bit-wide port for 40-bit long writes, as well as an 8-bit input for 40-bit long reads. Because each uni 
has Its own 32-bit write port, when performing 32 bit operations all eight units can be used in parallel every cycle Since 
each multiplier can return up to a 64-bit result, two write ports are provided from the multipliers to the register file. 

Register File Cross Paths 

f f ^"i' ^^ads directly from and writes directly to the register file within its own data path That is 

the .LI, ^S1. .D1, and .M1 units write to register file A and the .L2, .S2, .02. and .M2 units write to register file B The 
register fi^s are connected to the opposite-side register file's functional units via the IX and 2X cross paths These 
cross paths allow functional units from one data path to access a 32-bit operand from the opposite side's register file 
! T , P^*^ ^'^ functional units to read their source from register file B. Similarly, the 2X cross 

path allows data path B's functional units to read their source from register file A 

qT^?o n!^'^^J oo^^ f""<=ti°"al units have access to the opposite side's register file via a cross path. The Ml M2 

^.TV'^l selectable between the cross path and the same side register file.' In the 
case of the .LI and .L2 both src1 and src2 inputs are also selectable between the cross path and the same-side register 

[0039] Only two cross paths. IX and 2X, exist in this embodiment of the architecture. Thus the limit is one source 
read from eacli data path s opposite register file per cycle, or a total of two cross-path source reads per cycle Advan- 
ageously multiple units on a side may read the same cross-path source simultaneously. Thus the cross path operand 
for one side may be used by any one. multiple or all the functional units on that side in an execute packet In the C62x/ 

Sn/m^^A VT ''^'^ P"*^' ^^^""'^ P"^^^' """^^ 9^' ^" "P^'-^"^ from the opposite register file. 

LI ? A , ^ ^'"'^'^ ^^"'^ introduced whenever an instruction attempts to read a register via a cross path that 

Tardl' mop' T""' ' P^''^ "^^'^ '^^-^^^ automatically by the 

hardware, no NOP instruction is needed. It should be noted that no stall is introduced if the register being read is the 
destination for data loaded by a LDx instruction. 

Memory. Load and Store Paths 

l°c,TJrJTVrc ° ^.^P^^,^^^^"'"^ loads and stores. There are four 32-bit paths for loading data for memory 

B Ln2.T H t v.", o J'' '2 '^^^^ " P^'^ 32 MSBs. For side 

B LD2a IS the load path for the 32 LSBs; LD2b is the load path for the 32 MSBs. There are also four 32-bit paths for 
storing register values to memory from each register file. ST1a is the write path for the 32 LSBs on side A' ST1 b is the 
32 mIbs ^^^^ ^'"^ "'"^^ ^' ^"^^^ '^^ ^"'^ P^''' '^^ ^2 ®"r2b is the write path for the 

[0042] Some of the ports for long and double word operands are shared between functional units. This places a 
constraint on which long or double word operations can be scheduled on a datapath in the same execute packet. 

Data Address Paths 



0043] Bus 40a has an address bus DAI which is driven by mux 200a. This allows an address generated by either 
load/store unit D1 or D2 to provide a memory address for loads or stores for register file 20a. Data Bu.LDI loads data 
from an address in memory 22 specified by address bus DAI to a register in load unit D1 . Unit D1 nv.y nvmipulate the 
data provided pnor to storing it in register file 20a. Likewise, data bus ST1 stores data from register f.u -^Oa to memory 



8 



00441 -The DM an^ ° I, and Ihe L^^^^^^ ^ using the LD2a 

.onsisis of the OM addr ^^^^ ^^^a and ST2 ^'^^^^"'.^nd cated with the T2 

;ddress Vrrs'- d .nstruction uses the .Dl on gen ^ ^^^^^,,3 .nd>ca 

fields lor load and slor .^^ j^^^ ,nst ^^e. The use 

Uth resource from DA2 ^° n er-ngUng instruc- 

designation. 013, b"! „f i unit 18a or I8b, vvn. multi-field in gther with 

jOObVJl ^ long mux f .-j.n pven ae="'''^ „.,v 712 and zeros ^ 

706bilsADD1(31.0).iony gg^d70.tv ^^^^^^^^ logical.and 

hardware is 3^^;^^*! '° (bed in more detail "^/^^ ^1. W12 dunng a^V ^ within a given e ^^^^^ 

730-733 hold partial resu multipliers 711 a^b. ^^^^.^^^ ^^^^'^'.J^ 3 and saturation mulUpi ^^,^,3 
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multiply in parallel with M multiply unit 710. The lower 32 bits (bits 31 :0) of a result are selected by multiplexer 734 and 
are stored in the even register of a register pair. The upper 32 bits (bits 63:33) of the result are selected by multiplexer 
/ J5 and are stored in the odd register of the register pair. A more detailed description of configurable multiply circuitry 

.s provided in co-assigned U.S. Patent application (TI-2601 0) entitled Data Processor With Flexible 

Multiply Unit and is incorporated herein by reference. Details of the Galois multiply unit are provided in co-assigned 

U.S. Patent application (TI-26013) to David Hoyle .entitled Galois Field Multiply and is incorporated 

herein by reference. 

[0056] Figure 2e is a top level block diagram of D group unit 12a.b, which executes the load/store instmctlons and 
performs address calculations. D unit group 12a,b interfaces with Data Memory Controller 22. Load and Store instruc- 
tions operate on data sizes from 8 bits to 64 bits. The different addressing modes supported by the D unit group are 
basic addressing, offset addressing, indexed addressing, auto-increment/auto-decrement. long immediate addressing 
and circular addressing. In basic addressing mode, the content of a register is used as a memory address In offset 
addressing mode, the memory address is determined by two values, a base value and an offset that is either added 
or subtracted from the base. The base value always comes from an address register, whereas the offset value may 
come from either an address register or a 5-blt unsigned constant contained in the instruction. Index addressing mode 
functions the same as offset addressing mode, except that the offset is interpreted as an index into a table of bytes 
halfwords. words or double-words, as indicated by the data size of the load or store operation. In auto-increment/ 
decrement addressing mode, the base register is incremented/ decremented after the execution of the load/store in- 
struction There are two sub-modes, pre-increment/decrement. where the new value in the base register is used as 
the load/store address, and postincrement/decrement where the original value in the register is used as the load/store 
address. In long-immediate addressing mode, a 14-bit unsigned constant is added to a base register to determine the 
memory address. In circular addressing mode, the base register along with a block size define a region in memory To 
access a memory location in that region, an new index value is generated from the original index modulo the block 

^'^^o^! tT^^'^ ^'^""'^ ^-^^^ '^^'^ ^"^ "'"'^ ^° '^^ '^^"t^o' register file 102 using buses 220 

and 221 . Table 3 lists the control registers contained in the control register file, and briefly describes each The control 
registers are described more fully later herein. Each control register is accessed by the MVC instruction see the MVC 
instruction description later herein. 

[0057] Additionally some of the control register bits are specially accessed in other ways. For example arrival of a 
maskable interrupt on an external interrupt pin. INTm. triggers the setting of flag bit IFRm. Subsequently, when that 
inten-upt is processed, this triggers the clearing of IFRm and the clearing of the global Interrupt enable bit GIE Finally 
when that interrupt processing is complete, the B IRP instruction in the interrupt service routine restores the GlE's pre- 
interrupt value. Similarly, saturating instructions like SADD set the SAT (saturation) bit in the CSR (Control Status 
Kegister). 



Table 3, 





Control Registers 


Abbreviation 


.Name 


Description 


AMR 


Addressing mode register 


Specifies whether to use linear or circular addressing for 
one of eight registers: also contains sizes for circular 
addressing 


CSR 


Control status register 


Contains the global interrupt enable bit. cache control bits, 
and other miscellaneous control and status bits 


IFR 


Interrupt flag register 


Displays status of interrupts 


ISR 


Interrupt set register 


Allows you to set pending interrupts manually 


ICR 


Interrupt clear register 


Allows you to clear pending Interrupts manually 


lER 


Interrupt enable register 


Allows enabling/disabling of individual interrupts 


ISTP 


Interrupt service table pointer 


Points to the beginning of the interrupt service table 


IRP 


Interrupt return pointer 


Contains the address to be used to return from a maskable 
interrupt 


NRP 


Nonmaskable interrupt return pointer 


Contains the address to be used to return from a 
nonmaskable interrupt 


IN 


General-purpose input register 


Contains 32 input signals 
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Table 3. (continued) 



Control Registers 


Abbreviation 


Name 


Description 


OUT 


General-purpose output register 


Contains 32 output signals 


PCE1 


Progrann counter 


Contains the address of the fetch packet that contains the 
execute packet in the E1 pipeline stage 


PDATA_0 


Program data out 


Contains 32 output signals; used by the STP instruction to 
write to program space 



Pipelining/Timing of Control Register Accesses 

[0058] As shown in Table 4, all MVC are single-cycle instructions, completing their access of the explicitly named 
registers in the El pipeline phase. This is true whether MVC is moving a general register to a control register or vice 
versa. In all cases, the source register content is read, moved through the .32 unit, and written to the destination 
register in the El pipeline phase. 



Table 4. 



Control register Pipelining 


Pipeline Stage 


E1 


Read 


src2 


Written 


dst 


Unit in use 


.32 



[0059] Even though MVC modifies the particular target control register in a single-cycle, it can take extra clocks to 
complete modification of the non-explicitly named register For example, the MVC cannot modify bits in the IFR directly 
Instead. MVC can only write 1 's into the ISR or the ICR to specify setting or clearing, respectively, of the IFR bits. MVC 
completes this ISR/ICR write in a single (El) cycle (as described above) but the modification of the IFR bits themselves 
occur one clock later. 

[0060] Saturating instructions, such as SADD. set the SAT bit in the Control Status Register (CSR) indirectly. As a 
result, several of these instructions update the SAT bit one full clock cycle after their primary results are written to the 
register file. For example. SMPY writes its result at the end of pipeline stage E2; its primary result is available after 
one delay slot. In contrast, the SAT bit in the CSR is updated one cycle later than the result is'" written; this update 
occurs after two delay slots. (For specific instruction behavior, refer to the instruction's individual.description). 
[0061] The B IRP and B NRP instructions directly update the GIE and NMIE, respectively. Because these branches 
directly modify the CSR and lER (Interrupt Enable Register) respectively, there are no delay slots between when the 
branch is issued and when the control register updates take effect. 

[0062] Figure 3A shows the addressing mode register, (AMR). Eight registers (A4-A7, B4-B7) can perform circular 
addressing. For each of these registers, the AMR specifies the addressing mode. A 2-bit field for each register is used 
to select the address modification mode: linear (the default) or circular mode. With circular addressing, the field also 
specifies which BK (block size) field to use for a circular buffer. In addition, the buffer must be aligned on a byte boundary 
equal to the block size. The mode select field encoding is shown in Table 5. 



Table 5. 



Addressing Mode Field Encoding 


Mode 


Description 


00 


Linear modification (default at reset) 


01 


Circular addressing using the BKO field 


10 


Circular addressing using the BK1 field 


11 


Reserved 



[0063] The block size fields. BKO and BK1 , specify block sizes for circular addressing. The five bits in BKO and BK1 
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specify the width. The formula for calculating the block size width is; 



Block size (in bytes) = 2 

where N is the value in BK1 or 80 
[0064] Table 6 shows block size calculations for all 32 possibilities. 

Table 6. 



Block Size Calculations 


N 


Block Size 


N 


Block Size 


00000 


2 


10000 


131.072 


00001 


4 


10001 


262,144 


00010 


8 


10010 


524,288 


00011 


16 


10011 


1.048.576 


00100 


32 


10100 


2.097.152 


00101 


64 


10101 


4.194.304 


00110 


128 


10110 


8,388.608 


00111 


256 


10111 


16.777,216 


01000 


512 


11000 


33,554.432 


01001 


1,024 


11.001 


67.108,864 


01010 


2,048 


11010 


134.217,728 


01011 


4.096 


11011 


268.435,456 


01100 


8,192 


11100 


536.870,912 


01101 


16,384 


11101 


1.073.741.82 4 


01110 


32,768 


11110 


2.147.483.64 8 


01111 


65.536 


11111 


4.294.967.29 6 


Note: when N is 11111, the behavior is identical to linear addressing 



[0065] The control status register (CSR), shown in Figure 3B. contains control and status bits. The function of the 
bit fields in the CSR are shown in Table 7. 

Table 7. 



Control Status Register: Bit Fields. Read/Write Status and Function 


Bit Position 


Width 


BitField Name 


Function 


31-24 


8 


CPU ID 


CPU ID. Defines which CPU. 


23-16 


8 


Rev ID 


Revision ID. Defines silicon revision of the CPU. 


15-10 


6 


PWRD 


Control power down modes. The values will always be read as zero. 


9 


1 


SAT 


The saturate bit. set when any unit performs a saturate, can be cleared 
only by the MVC instruction and can be set only by a functional unit. The 
set by a functional unit has priority over a clear (by the MVC instruction) 
if they occur on the same cycle. The saturate bit gets set one full cycle (1 
delay slot) after a saturate occurs. This bit will not be modified by a 
conditional instruction whose condition is false. 


8 


1 


EN 


Endian bit: 1 = little endian, 0 = big endian. 
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Table 7. (continued) 



Control Status Register: Bit Fields. ReadAA/rite Status and Function 


Dii n osiuon 


vv luii 1 


RitPf*=lri Name 


Function 


7-5 


3 


PCC 


Program cache control mode 


4-2 


3 


DCC 


Data cache control mode 


1 


1 


PGIE 


Previous GIE (global interrupt enable). Saves GIE when an interrupt is 
taken. 


0 


1 


GIE 


Global interrupt enable. Enables (1) or disables (0) all interrupts except 
the reset interrupt and NMl (nonmaskable interrupt). 



[0066] Figure 3C shows an E1 Phase Program Counter (PCE1) register, that contains a 32-bit address of a fetch 
packet in the E1 pipeline phase. 

[0067] Figure 4 shows a Galois Field Polynomial Generator Function Register, (GFPGFR). The GFPGFR contains 
a Galois field polynomial generator and field size control bits. These bits control the operation of the GMPY4 instruction. 
This register is only settable via the MVC instruction. 

[0068] Table 8 defines the fields of the GFPGFR. The default function after reset for the GMPY4 instruction of field 
size=7 and polynomial=0x1 D. 



Table 8. 



GFPGFR Fields 


Bit Position 


Width 


Field Name 


Function 


31-27 


5 


RESERVED 


Reserved. Read as zero. Ignored on write. 


26-24 


3 


SIZE 


Field size. 


23-8 


16 


RESERVED 


Reserved. Read as zero. Ignored on write. 


7-0 


8 


POLY 


Polynomial Generator. 



[0069] If a GMPY4 instruction is contained in the next execute packet after an MVC instruction that changes the 
GPPGFR value, the GMPY4 will be controlled by the newly loaded GFPGFR value. 

[0070] A general-purpose input register (IN) depicted in Figure 5 supports 32 general-purpose input signals, while 
a general-purpose output register (OUT) depicted in Figure 6 supports 32 general-purpose output signals. The function 
of these signals is described later herein. 
[0071] Table 9 below explains various symbols used herein. 



Table 9. 



Instruction Operation and Execution Notations 


Symbol 


Meaning 


long 


40-bit register value 


+a 


Perform twos-complement addition using the addressing mode defined by the AMR 


-a 


Perform twos-complement subtraction using the addressing mode defined by the AMR 


xor 


Bitwise exclusive OR 


not 


Bitwise logical complement 


t»y..z 


Selection of bits y through z of bit string b 


»s 


Shift right with sign extension 


»z 


Shift right with a zero fill 




X clear b,e 


Clear a field in x. specified by b (beginning bit) and e (ending bit) 


X exts l,r 


Extract and sign-extend a field in x. specified by 1 (shift left value) and r (shift right value) 
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Table 9. (continued) 



Instruction Operation and Execution Notations 


Symbol 


Meaning 


X extu /,r 


Extract an unsigned field in x. specified by 1 (shift left value) and r (shift right value) 


+s 


Perform twos-complement addition and saturate the result to the result size, if an overflow or 

UllUtrfllOW uCCUlb 




Kerrorm twos-compiement subtraction and saturate the result to the result size, if an overflow or 
underflow occurs 


X set b,e 


Set field in x, to all 1s specified by b (beginning bit) and e (ending bit) 


InnbO(x) 


Leftmost 0 bit search of x 


Imb1(x) 


Leftmost 1 bit search of x 


norm(x) 


Leftmost nonredundant sign bit of x 


Synnbol . 


Meaning 


abs{x) 


Absolute value of x 


and 


Bitwise AND 


bi 


Select bit i of source/destination b 


bit_count 


Count the number of bits that are 1 in a specified byte 


bit_reverse 


Reverse the order of bits in a 32-bit register 


byteO 


8-bit value in the least significant byte position in 32-bit register (bits 0-7) 


byte1 


8-bit value in the next to least significant byte position in 32-bit register (bits 8-15) 


byte2 


8-bit value in the next to most significant byte position in 32-bit register (bits 16-23) 


byte3 


8-bit value in the most significant byte position in 32-bit register (bits 24-31) 


bv2 


Bit Vector of two flags for s2 or u2 data type 


bv4 


Bit Vector of four flags for s4 or u4 data type 


cond 


Check for either creg equal to 0 or creg not equal to 0 


creg 


3-bit field specifying a conditional register 


cstn 


n-bit constant field (for example. cst5) 


dst_h or dst_o 


msb32 of dst (placed in odd register of 64-bit register pair) 


dst_i or dst_e 


Isb32 of dst (place in even register of a 64-bit register pair) 


dws4 


Four packed signed 16-bit integers in a 64-bit register pair 


dwu4 


Four packed unsigned 16-bit integers in a 64-bit register pair 


gmpy 


Galois Field Multiply 


i2 


Two packed 16-bit integers in a single 32-bit register 


i4- 


Four packed 8-bit integers in a single 32-bit register 


int 


32-bit integer value 


Isbn or LSBn 


n least significant bits (for example, 1 sb16) 


msbn or MSBn 


n most significant bits (for example, msb16) 


nop 


No operation 


or 


Bitwise OR 


R 


Any general-purpose register 
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Table 9. (continued) 





Instruction Operation and Execution Notations 




Synnbol 


Meaning 


5 


rotl 


Rotate left 




sat 


Saturate 




sbyteO 


Signed 8-bit value in the least significant byte position in 32-bit register (bits 0-7) 


10 


sbytel 


Signed 8-bit value in the next to least significant byte position in 32-bit register (bits 8-15) 




sbyte2 


Signed 8-bit value in the next to nnost significant byte position in 32-bit register (bits 16-23) 




sbyte3 


Signed 8-bit value in the nnost significant byte position in 32-bit register (bits 24-31) 




scstn 


Signed n-bit constant field (for example. scst7) 


15 


se 


Sign-extend 




sint 


Signed 32-bit integer value 




slsb16 


Signed 16-bit integer value in lower half of 32-bit register 


20 


smsbl6 


Signed 16-bit integer value in upper half of 32-bit register 




s2 


Two packed signed 16-bit integers in a single 32-bit register 




s4 


Four packed signed 8-bit integers in a single 32-bit register 


25 


sllong 


Signed 64-bit integer value 


ubyteO 


Unsigned 8-bit value in the least significant byte position in 32-bit register (bits 0-7) 




ubytel 


Unsigned 8-bit value in the next to least significant byte position in 32-bit register (bits 8-15) 




ubyte2 


Unsigned 8-bit value in the next to most significant byte position in 32-bit register {bits-16-23) 


30 


ubyteS 


Unsigned 8-bit value in the most significant byte position in 32-bit register (bits 24-31 ) 




ucstn 


n-bit unsigned constant field (for example, ucst5) 




uint 


Unsigned 32-bit integer value 


35 


ullong 


Unsigned 64-bit integer value 


ulsb16 


Unsigned 16-bit integer value in lower half of 32-bit register 




unnsbl 6 


Unsigned 16-bit integer value in upper half of 32-bit register 




u2 


Two packed unsigned 16-bit integers in a single 32-bit register 


40 


u4 


Four packed unsigned 8-bit integers in a single 32-bit register 




xl2 


Two packed 16-bit integers in a single 32-bit register that can optionally use cross path 




xi4 


Four packed 8-bit integers in a single 32-bit register that can optionally use cross path 


45 


xsint 


Signed 32-bit integer value that can optionally use cross path 


xs2 


Two packed signed 16-bit integers in a single 32-bit register that can optionally use cross path 




xs4 


Four packed signed 8-bit integers in a single 32-bit register that can optionally use cross path 




xuint 


Unsigned 32-bit integer value that can optionally use cross path 


50 


Symbol 


Meaning 




xu2 


Two packed unsigned 1 6-bit integers in a single 32-bit register that can optionally use cross path 




xu4 


Four packed unsigned 8-bit integers in a single 32-bit register that can optionally use cross path 


55 




Assignment 


+ 


Addition 




++ 


Increment by one 
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Table 9. (continued) 



Instruction Oper 


ation and Execution Notations 


Synnbol 


Meaning 


X 


Multiplication 


- 


Subtraction 


> 


Greater than 


< 


Less than 


« 


Shift left 


» 


Shift right 


>= 


Greater than or equal to 


<= 


Less than or equal to 




Equal to 




Logical Inverse 


& 


Logical And 



Table 10 defines the mapping between instructions and functional units for a set of basic instructions included in a DSP 
described In U.S. Patent Serial No. 09/012.813 (TI-25311, incorporated herein by reference). Table 11 defines a map- 
ping between instructions and functional units for a set of extended Instructions in an embodiment of the present 
invention. Alternative embodiments of the present invention may have different sets of instructions and functional unit 
mapping. Tables 10 and 11 are illustrative and are not exhaustive or intended to limit various embodiments of the 
present invention. 

Table 10. 



Instruction to Functional Unit Mapping of Basic Instructions 



.L Unit 


.M Unit 


.S Unit 


.D Unit 


ABS 


MPY 


ADD 


ADD 


ADD 


SMPY 


ADDK 


ADDA 


AND 




ADD2 


LD mem 


CMPEQ 




AND 


LD mem (15-bit offset) (02 only) 


CMPGT 




B disp 


MV 


CMPGTU 




8 IRP 


NEG 


CMPLT 




B NRP 


ST mem 


CMPLTU 




B reg 


ST mem (1 5-bit offset) (D2 only) 


LMBD 




CLR 


SUB 


MV 




EXT 


SUBA 


NEG 




EXTU 


ZERO 


NORM 




MVC 




NOT 




MV 




OR 




MVK 




SADD 




MVKH 




SAT 




NEG 




SSUB 




NOT 
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Table 10. (continued) 



5 



10 



15 



Instruction to Functional Unit Mapping of Basic Instructions 


.L Unit 


.M Unit 


.S Unit 


.D Unit 


SUB 




OR 




SUBC 




SET 




XOR 




SHL 




ZERO 




SHR 








SHRU 








SSHL 








STP (82 only) 








SUB 








SUB2 








XOR 








ZERO 





Table 11. 



25 


Instruction to Functional Unit Mapping of Extended Instructions 


1 ■ m it 

.L unit 


•M unit 


*^ unit 


.D unit 




ABS2 


AVG2 


ADD2 


ADD2 




ADD2 


AVGU4 


ADDKPC 


AND 


30 


ADD4 


BITC4 


AND 


ANDN 




AND 


BITR 


ANDN 


LDDW 




ANDN 


DEAL 


BDEC 


LDNDW 


35 


MAX2 


D0TP2 


BNOP 


LDNW 


MAXU4 


D0TPN2 


BROS 


MVK 




MIN2 


DOTPNRSU2 


CMPEQ2 


OR 


40 


MINU4 


DOTPNRUS2 
DOTPRSU2 D 
OTPRUS2 


CMPEQ4 
CMPGT2 
CMPGTU4 


STDW 




MVK 


DOTPSU4 
DOTPUS4 


CMPLT2 


STNDW 


45 


OR 


D0TPU4 


CMPLTU4 


STNW 




PACK2 


GMPY4 


MVK 


SUB2 




PACKH2 


MPY2 


OR 


XOR 




PACKH4 


MPYHI 


PACK2 




50 


PACKHL2 


MPYHIR 

MPY1H- 

MPYIHR 


PACKH2 




55 


PACKL4 


MPY1L 

MPYILR 

MPYLI 


PACKHL2 






PACKLH2 


MPYLIR 


PACKLH2 
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Table 11. 


(continued) 


Instruction to 


Functional Unit Mapping of Extended Instructions 


.L unit 


.M unit 


.S unit 


.D unit 


SHLMB 


MPYSU4 

Iv! K Y U o4 


SADD2 




SHRMB 


Mr' tU4 


SADDU4 




SUB2 


Ivl VU 


SADDSU2 
SADDUS2 




SUB4 


ROTL 


SHLMB 




SUBABS4 


SHFL 


SHR2 




SWAP2 


SMPY2 


SHRMB 




.L unit 


.M unit 


.S unit 


.D unit 


SWAP4 


SSHVL 


SHRU2 




UNPKHU4 


SSHVR 


SPACK2 




UNPKLU4 


XPND2 


SPACKU4 




XOR 


XPND4 


SUB2 








SWAP2 








UNPKHU4 








UNPKLU4 








XOR 





[0072] The DSP's opcode map is shown in Figures 8A-8J. Refer to Table 9 and the instruction descriptions later 
herein for explanations of the field syntaxes and values. An instruction syntax is used to describe each instruction. The 
opcode map breaks down the various bit fields that make up each instruction. There are certain instructions that can 
be executed on more than one functional unit, as was shown in Table 11 . The syntax specifies the functional unit and 
various resources used by an instruction, typically as follows; 

EXAMPLE (.unit) src. dst 

[0073] The following are examples of what the syntax looks like for the ADD instruction; 

1) ADD (.unit) src7, src2, dst 

2) ADDU (.unit) src1, src2, dst 

3) ADD (.unit) src2, $rc1 , dst 

unit = .LI, .L2, .81. .82, .D1. .D2 

[0074] src and dst indicate source and destination respectively. The (.unit) dictates which functional unit the instruc- 
tion is mapped to (.LI. .L2. .81. .82. .M1, .M2. .D1. or.D2). This instruction has three opcode map fields; src1 , src2, 
and dst. The condition is controlled by a 3-bit (creg) field specifying a register to be tested, and a 1-bit field (z) specifying 
a test for zero or nonzero, as shown in Figures 3A-3J. The four MSBs of every opcode are creg and z. The specified 
register is tested at the beginning of the El pipeline stage for all instructions. The pipeline is described later herein. If 
z = 1, the test is for equality with zero. If z = 0. the test is for nonzero. The case of condition register field (creg) = 0 
and z = 0 is treated as always true to allow instructions to be executed unconditionally. The creg register field is encoded 
as shown in Table 12. 
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Table 12. 



Registers That Can Be Tested by Conditional Operations 


Creg 


z 


Register Tested 




30 


29 


28 






0 


0 


0 


Unconditional. 


0 


0 


0 


1 


Reserved: When selected this indicates a SWBP instruction 


0 


0 


1 


z 


80 


0 


1 


0 


z 


81 


0 


1 


1 


z 


82 


1 


0 


0 


z 




1 


0 


0 


z 


A1 


1 


0 


1 


z 


A2 


1 


1 


X 


X 


Reserved 



Notes : x is don't care for reserved cases. 



[0075] Conditional instructions are represented by "[ ]" surrounding the condition register. The following execute 
packet contains two ADD instructions in parallel. The first ADD is conditional on BO being non-zero. The second ADD 
is conditional on BO being zero. '!' indicates the 'not' of the condition. 

[BO] ADD .LI A1,A2,A3 

II [ ! BO] ADD . L2 Bl, B2, B3 

[0076] The above instructions are mutually exclusive. This means that only one will execute. 

[0077] If they are scheduled in parallel, mutually exclusive instructions must still follow all resource constraints men- 
tioned later herein. - 
[0078] If mutually exclusive instructions share any resources as described later herein, they cannot be scheduled in 
parallel (put in the same execute packet), even though only one will end up executing. 

[0079] The execution of instructions can be defined in terms of delay slots. Table 1 3 shows the types of instructions, 
how many delay slots each type instruction has , and the execute phases it uses. The number of delay slots is equivalent 
to the number of cycles required after the source operands are read for the result to be available for reading. For a 
single-cycle type instruction (such as CMPGT2), source operands read in cycle i produce a result that can be read in 
cycle i + 1 . For a two-cycle instruction (such as AVGU4), source operands read in cycle i produce a result that can be 
read in cycle i + 2. For a four-cycle instruction (such as DOTP2), source operands read in cycle i produce a result that 
can be read in cycle i + 4. Table 13 shows the number of delay slots associated with each type of instruction. 
[0080] Delay slots are equivalent to an execution or result latency. All of the additional instructions have a functional 
unit latency of 1. This means that a new instruction can be started on the functional unit every cycle. Single-cycle 
throughput is another term for single-cycle functional unit latency. 



Table 13. 



Delay Slot and Functional Unit Latency Summary 


Instruction Type 


Delay Slots 


Function al Unit 
Latency 


Read Cycles \|/ 


Write Cycles V}/ 


Branch Taken v|/ 


NOP (no 
operation) 


0 


1 








Store 


0 


1 


i 


i 




Single cycle 


0 


1 


i 


i 





\\t Cycle i is in the El pipeline phase. 
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Table 13. (continued) 



Delay Slot and Fun 


3tional Unit Latency Summary 


Instruction Type 


Delay Slots 


Function a! Unit 
Latency 


Read Cycles v|/ 


Write Cvrlpc; \tf 


R r o r» n 

Diancn laKen \|/ 


Two cycle 


1 


1 


i 


i + 1 




Multiply (16x16) 


1 


1 


i 


i + 1 




Four cycle 


3 


1 


i 


i + 3 




Load 


4 


1 


1 


i. i + 4P 




Branch 


5 


1 


'<!> 




i + 5 



p For loads, any address modirication happens in cycle i. The loaded data is written into the register tile in cycle i 
0 The branch to latjel. branch to IRP. and branch to NRP instmctions do not read any general purpose registers. 



+ 4. 



[0081] Instructions are always fetched eight at a time. This constitutes a fetch packet. The basic format of a fetch 
packet IS shown in Figure 9A. The execution grouping of the fetch packet is specified by the p-bit bit zero of each 
instruction. Fetch packets are 8-word aligned. 

[0082] The p bit controls the parallel execution of instructions. The p bits are scanned from left to right (lower to 
higher address). If the p bit of instruction / is 1 , then instruction / +1 is to be executed in parallel with (in the same cycle 
as) instruction /. If the p-bit of instruction / is 0. then instruction / + 1 is executed In the cycle after instruction / All 
instructions executing in parallel constitute an execute packet. An execute packet can contain up to eight instructions 
All instructions in an execute packet must use a unique functional unit. 

[0083] Figure 16A is an illustration of fetch packets and execution packets in a prior art processor. If an execution 
packet did not fit evenly within a fetch packet. NOP instructions were inserted in the instruction stream to pad out the 
fetch packet. For example, in Figure 1 6A, execution packet E3 cannot fit In the four-word space directly after execution 
r^nno!,' n instructions are inserted in the instruction sequence to pad out the fetch packet 

[0084] Figure 16B is an illustration of execution packets spanning fetch packets for the processor of Figure 1 Ad- 
vantageously, m the present embodiment of processor 10, an execution packet can cross an eight-word fetch packet 
T f^^oi ^'^''^ eliminating a need to add NOP instructions to pad fetch packets. For example, eight-word execution 
packet EP1 completely occupies fetch packet 1600. Four-word execution packet EP2 partially fills fetch packet 1602 
fioT!l^''^''""°" ^^^^^^ completely within fetch packet 1602, however, the first four words 1621 

fn7 It ""^^"^ ^^^^^"^ ^^^"^ P^'^''^' and the last two words 1625, 1626 are placed in fetch packet 

1604. Therefore, the last p bit In a fetch packet is not always set to 0. If the last p-bit of a fetch packet is not zero then 
instruction fetch control circuitry in stage 10a (Figure 1) fetches a second fetch packet and extracts instruction words 
until a p bit set to 0 is encountered. This sequence of instruction words is then ordered into a single execution packet 
such as execution packet EP3, for example. 

[0085] The following examples Illustrate the conversion of a p-bit sequence into a cycle-by-cycle execution stream 
of instructions. There are three types of p-bit pattems for fetch packets. These three p-bit patterns result in the following 
execution sequences for the eight instructions; fully serial; fully parallel; or partially serial. These three sequences of 
execution are explained more fully below. 

[0086] The fully serial p-bit pattern depicted in Figure 9B results in this execution sequence is illustrated in Table 14. 

Table 14. 



Fully Serial p-bit Pattern Execution Sequence 


Cycle 


Instructions 


1 


A 


2 


B 


3 


C 


4 


D 


5 


E 


6 


F 
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Table 14. (continued) 



Fully Serial p-bit Pattern Execution Sequence 


Cycle 


Instructions 


7 


G 


8 


H 



The eight instructions are executed sequentially. 

[0087] The fully parallel p-bit pattern depicted in Figure 9C results in this execution sequence is illustrated in Table 1 5. 



Table 15. 



Fully Parallel p-bit Pattern Execution Sequence 


Cycle 


Instructions 


1 


A 


B 


C 


D 


E 


F 


G 


H 



All eight instructions are executed in parallel. 

[0088] The partially serial p-bit pattern depicted in Figure 9D results in this execution sequence is illustrated in Table 
16. 

Table 16. 



Partially Serial p-bit Pattern Execution Sequence 


Cycle 


Instructions 


1 


A 






2 


B 






3 


C 


D 


E 


4 


F 


G 


H 



Note that the instructions C. D, and E do not use any of the same functional units, cross paths, or other-data path 
resources. This is also true for instructions F, G. and H. 

[0089] The 1 1 characters signify that an instruction is to execute in parallel with the previous instruction . In the previous 
partially serial exannple, the code would be represented as this: 





instruction 


A 




instruction 


B 




instruction 


C 


II 


instruction 


D 


II 


instruction 


E 




instruction 


F 


II 


instruction 


G 


II 


instruction 


H 



[0090] If a branch into the middle of an execution packet occurs, all instructions at lower addresses are ignored. In 
the partially serial example, if a branch to the address containing instruction D occurs, then only D and E will execute. 
Even though instruction C is in the same execute packet, it is ignored. Instructions A and B are also ignored because 
they are in earlier execute packets. 

Addressing Modes 

[0091] The addressing modes are linear, circular using BKO. and circular using BK1. The mode is specified by the 
addressing mode register (AMR). 

[0092] Eight registers can perform circular addressing. A4-A7 are used by the .D1 unit and B4-B7 are used by the . 
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02 unit. No other units can perfornn circular addressing modes. For each of these registers, the AMR specifies the 
addressing mode. 

[0093] The following instructions all use the AMR to determine what type of address calculations are performed for 
these registers: The LDB(U)/LDH(U)/LDW, STB/STH/STW, LDNDW, LDNW, STNDW. STNW LDDW STDW ADDAB/ 
ADDAH/ADDAW/ADDAD. and SUBAB/SUBAH/ SUBAW. There is no SUBAD instruction. All registers can perform 
linear mode addressing. 

[0094] Linear mode addressing operates as follows with LD/ST Instructions: linear mode simply shifts the offsetRI 
est operand to the left by 3. 2, 1 , or 0 for double word, word, half-word, or byte access respectively and then performs 
an add or subtract to baseR. (depending on the operation specified). For the preincrement, pre-decrement positive 
offset, and negative offset address generation options, the result of the calculation is the address to be accessed in 
memory. For post-increment or postdecrement addressing, the value of baseR before the addition or subtraction is the 
address to be accessed from memory. Address modes are specified by a mode field (see Figure 8E) and listed in Table 
1 7. The increment/decrement mode controls whether the updated address is written back to the register file . Otherwise 
It IS rather similar to offset mode. The pre-increment and offset modes differ only in whether the result is written back 
to "base". The post-increment mode is similar to pre-increment (e.g. the new address is written to "base") but differs 
in that the old value of "base" is used as the address for the access. The same applies for negative offset vs. decrement 
mode. 

[0095] Linear mode addressing operates as follows with ADDA/SUBA Instructions: linear mode simply shifts srd/cst 
operand to the left by 3, 2. 1. or 0 for double word, word, halfword, or byte access respectively and then performs an 
add or subtract, (depending on the operation specified). 



Table 17 ■ 



Address Ge 


nerator Options 


Mode Field 


Syntax 


Modification Performed 


0 10 1 


*+R[offsetR] 


Positive offset; addr = base + offset * scale 


0 10 0 


*-R[ offsetR] 


Negative offset; addr = base - offset * scale 


110 1 


*++R[offsetR] 


Preincrement; addr = base + offset * scale; base = addr 


110 0 


*--R[ OffsetR] 


Predecrement; addr = base - offset * scale; base = addr 


1111 


*R++[ offsetR] 


Postincrement; addr = base; base = base + offset * scale 


1110 


*R"[offsetR] 


Postdecrement; addr = base; base = base - offset * scale 


0 0 0 1 


*+R[ ucst5] 


Positive offset; addr = base + offset * scale 


0 0 0 0 


*-R[ ucstS] 


Negative offset; addr = base - offset * scale 


10 0 1 


*++R[ucst5] 


Preincrement; addr = base + offset * scale; base = addr 


10 0 0 


*- -R[ ucst5] 


Predecrement; addr = base - offset * scale; base = addr 


10 11 


*R++[ucst5] 


Postincrement; addr = base; base = base + offset * scale 


10 10 


*R- -[ ucstS] 


Postdecrement; addr = base; base = base - offset * scale 



[0096] Circular mode addressing uses the BKO and BK1 fields in the AMR to specify block sizes for circular address- 
ing. Circular mode addressing operates as follows with LD/ST Instructions: after shifting offsetR/cst to the left by 3 2 
1 , or 0 for LDDW,.LDW, LDH, or LDB respectively, and is then added to or subtracted from baseR to produce the final 
address. This add or subtract is performed by only allowing bits N through 0 of the result to be updated leaving bits 
31 through N+1 unchanged after address arithmetic. The resulting address is bounded to 2'^(N+1) ranqe reaardless 
of the size of the offsetR/cst. y ■ a 

[0097] As mentioned before, the circular buffer size in the AMR is not scaled: for example: a size of 8 is 8 bytes not 
8 X size of (type). So, to perform circular addressing on an array of 8 words, a size of 32 should be specified or N = 
4. Table 18 shows an LOW performed with register A4 in circular mode, with BKO = 4, so the buffer size is 32 bytes 
16 halfwords, or 8 words. The value put in the AMR for this example is 00040001 h. 
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Table 18. LDW in Circular Mode 



LDW , Di [9] , Al 




Before LDW 




1 cycle after LDW 




5 cycles after LDW 




A4 


0000 
OlOOh 




A4 


0000 0104h 




A4 


0000 
0104h 






















Al 


XXXX 
XXXXh 




Al 


XXXX XXXXh 




Al 


1234 
5678h 


Mem 
104h 


1234 
5678h 




naem 104 
h 


1234 5678h 




mem 10 4h 


1234 
5678h 



Note: 9h words is 24h bytes. 24h bytes is 4 bytes beyond the 
32-byte (20h) boundary lOOh-llFh; thus, it is wrapped around to 
(124h - 20h = 104h) . 

[0098] Circular mode addressing operates as follows with ADDA/SUBA Instructions; after shifting srclfcst to the left 
by 3. 2. 1 . or 0 for ADDAD, ADDAW, ADDAH, or ADDAB respectively, and is then added to or subtracted fronn baseR 
to produce the final address. This add or subtract is performed by only allowing bits N through 0 of the result to be 
updated, leaving bits 31 through N + 1 unchanged afteraddress arithmetic. The resulting address is bounded to 2'^(N+1 ) 
range, regardless of the size of the offsetR/cst. 

[0099] As mention before, the circular buffer size in the AMR is not scaled, for example: a size of 4 is 4 bytes, not 4 
X size of (type). So. to perform circular addressing on an array of 8 words, a size of 32 should be specified, or N = 4. 
Table 19 shows an ADDAH performed with register A4 in circular mode, with BKO = 4, so the buffer size is 32 bytes. 
16 halfwords, or 8 words. The value put in the AMR for this example is 0004 0001h. ' 



Table 19. 



ADDAH in Circular Mode 


ADDAH .D1 A4.A1.A4 


Before ADDAH 




1 cycle after ADDAH 


A4 


0000 OlOOh 




A4 


0000 0106h 












A1 


0000 001 3h 




A1 


0000 001 3h 



13h halfwords is 26h bytes. 26h bytes is 6 bytes beyond the 32-byte (20h) boundary 100h-11Fh; thus, it is wrapped 
around to (126h - 20h = 106h). 

Non-Allgned Memory Access Considerations 

[0100] Circular addressing may be used with non-aligned accesses. When circular addressing is enabled, address 
updates and memory accesses occur in the same manner as for the equivalent sequence of byte accesses. The only 
restriction is that the circular buffer size be at least as large as the data size being accessed. Non-aligned access to 
circular buffers that are smaller than the data being read will cause undefined results. 

[0101] Non-aligned accesses to a circular buffer apply the circular addressing calculation to logically adjacent mem- 
ory addresses. The result is that non-aligned accesses near the boundary of a circular buffer will correctly read data 
from both ends of the circular buffer, thus seamlessly causing the circular buffer to "wrap around" at ih-r edges. 
[0102] Figures 10A and 108 illustrate aspects of non-aligned address formation and non-aligned i.ita extraction 
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from a circular buffer region, according to an aspect of the present invention. Consider, for example, a circular buffer 
500 that has a size of 1 6 bytes illustrated in Figure 10A. A circular buffer of this size is specified by setting either BKO 
or BK1 to "00011 For example with register A4 in circular mode and BKO =3. the buffer size is 16 bytes, 8 half words 
or 4 words. The value put in the AMR for this example is 00030001 h. The buffer starts at address 0x0020 (502) and 
ends at 0x002F (504). The register A4 is initialized to the address 0x0028. for example; however, the buffer could be 
located at other places in the memory by setting more significant address bits in register A4. Below the buffer at address 
0x1 F (506) and above the buffer at address 0x30 (508) data can be stored that is not relevant to the buffer. 
[01 03] The effect of circular buffering is to make it so that memory accesses and address updates in the 0x20 - 0x2F 
range stay completely inside this range. Effectively, the memory map behaves as illustrated in Figure 58. Executing a 
LDW instruction with an offset of 1 in post increment mode will provide an address of 0x0028 (511) and access word 
510. for example. Executing the instruction a second time will provide an address of 0x002C (513) and access word 
51 2 at the end of the circular buffer. Executing the instruction a third time will provide an address of 0x0020 (502a) and 
access word 514. Note that word 514 actually corresponds to the other end of the circular buffer, but was accessed 
by incrementing the address provided by the LDW instruction. 

[0104] Table 20 shows an LDNW performed with register A4 in circular mode and BKO = 3. so the buffer size is 16 
bytes. 8 half words, or 4 words. The value put in the AMR for this example is 00030001 h. The buffer starts at address 
0x0020 and ends at Ox002F. The register A4 is initialized to the address 0x002A. 



Table 20. 



LDNW in Circular Mode 



LDNW .Dl ■*'++A4 [2] , Al 




Before LDW 




1 cycle after LDW 




5 cycles after LDW 




A4 • 


0000 
002Ah 




A4 


0000 
0022h 




A4 


0000 00022h 






















Al 


xxxx 

XXXXh 




Al 


XXXX 
XXXXh 




Al 


5678 9ABCh 


Mem 0022h 


5678 
9ABCh 




mem 
0022h 


5678 
9ABCh 




mem 
0022h 


5678 9ABCh 



(lOh) boundary starting at address 002Ah; zhus, it is wrapped 
around to C022h (002Ah + 8h = 0022h) . 



Pipeline Operation 



[0105] The DSP pipeline has several key features which improve performance, decrease cost, and simplify program- 
ming. They are: increased pipelining eliminates traditional architectural bottlenecks in program fetch, data access, and 
multiply operations; control of the pipeline is simplified by.ellminating pipeline interlocks: the pipeline can dispatch eight 
parallel instructions every cyde; parallel instructions proceed simultaneously through the same pipeline phases: se- 
quential instructions proceed with the same relative pipeline phase difference: and load and store addresses appear 
on the CPU boundary during the same pipeline phase, eliminating read-after-write memory conflicts. 
[0106] A multi-stage memory pipeline is present for both data accesses and program fetches. This allows use of 
high-speed synchronous memories both on-chip and off-chip, and allows infinitely nestable zero-overhead looping with 
branches in parallel with other instructions. 

[0107] There are no internal interlocks in the execution cycles of the pipeline, so a new execute packet enters exe- 
cution every CPU cyde. Therefore, the number of CPU cydes for a particular algorithm with particular input data is 
fixed. If during program execution, there are no memory stalls, the number of CPU cydes equals the number of clock 
cycles for a program to execute. 
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[0108] Performance can be inhibited by stalls fronn the nnemory system, stalls for cross path dependencies, or inter- 
rupts. The reasons for memory stalls are determined by the memory architecture. Cross path stalls are described in 

detail in U.S. Patent Serial No. (TI-30563). to Steiss. et al and is incorporated herein by reference. 

To fully understand how to optimize a program for speed, the sequence of program fetch, data store, and data load 
requests the program makes, and how they might stall the CPU should be understood. 

[0109] The pipeline operation, from a functional point of view, is based on CPU cycles. A CPU cycle is the period 
during which a particular execute packet is in a particular pipeline stage. CPU cycle boundaries always occur at clock 
cycle boundaries; however, stalls can cause CPU cycles to extend over multiple clock cycles. To understand the ma- 
chine state at CPU cycle boundaries, one must be concerned only with the execution phases (E1-E5) of the pipeline. 
The phases of the pipeline are shown in Figure 11 and described in Table 21. 



Table 21. 



Pipeline Phase Description 


Pipeline 


Pipeline Phase 


oymo oi 


During This Phase 


lltbUUOllUlt 1 yptro 

Completed 


Program Fetch 


Program Address 
Generate 


PG 


Address of the fetch 
packet is determined. 






Program Address Send 


PS 


Address of fetch packet is 
sent to memory. 






Program Wail 


PW 


Program memory access 
is performed. 






Program Data Receive 


PR 


Fetch packet is expected 
at CPU boundary. 




Program Decode 


Dispatch 


DP 


Next execute packet in 
fetch packet determined 
and sent to the 
appropriate functional 
units to be decoded. 






Decode 


DC 


Instructions are decoded 
at functional units. 




Execute 


Execute 1 


E1 


For all instruction types. . 
conditionsfor instructions 
are evaluated and 
operands read. 
Load and store 
instructions: address 
generation is computed 
and address 
modifications written to 
register filet Branch 
instructions: affects 
branch fetch packet in PG 

phaset Single-cycle 
instructions; results are 
written to a register filet 


Single-cycle 



tjhis assumes that the conditions for the instructions are evaluated as true. If the condition is evaluated as false, the instruction -.vill not write any 
results or have any pipeline operation after E1 . 
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Table 21. 


(continued) 




Pipeline Phase Di 


Bscription 


Pipeline 


Pipeline Phase 


Synnb ol 


During This Phase 


Instruction Types 
Completed 




Execute 2 


E2 


Load instructions: 
address is sent to 
nnemoryt 

Store instructions and 
STP: address and data 
are sent to memoryt 
Single-cycle instructions 
that saturate results set 
the SAT bit in the Control 
Status Register (CSR) if 
saturation occurs, t 
Multiply instructions: 
resuiis are wriiien to a 
register filet 


Stores STP 
Multiplies 




Execute 3 


E3 


Data memory accesses 
are performed. Any 
multiply instruction that 
saturates results sets the 
SAT bit in the Control 
Status Register (CSR) if 
saturation occurs, t 






Execute 4 


E4 


Load instructions: data is 
brought to CPU 
boundaryt 






Execute 5 


E5 


Load instructions: data is 
loaded into register"!" 


Loads 



This assumes that the conditions for the instructions are evaluated as true. If the condition is evaluated as false, the instruction will not write 
results or have any pipeline operation after E1 . 



[0110] Referring again to Figure 11 and Figure 1 . the instruction execution pipeline of processor 1 0 involves a number 
of discrete stages, generally demarcated by temporary latches or registers to pass the results of one stage to the next 
Instruction pipeline phases PG, PS. PW. and PR all involve instruction fetching and are embodied in program fetch 
circuit 10 in association with program memory subsystem 23. Pipeline phases DP and DC involve instruction decoding- 
phase DP is embodied in dispatch circuitry 10b. while pipeline phase DC is embodied in decode circuitry 10c. The 
execution phases E1-E5 are embodied in stages embodied within each functional unit L. S. M and D. For example 
the D units embody all five execute stage in association with memory subsystem 22. Other of the functional units do 
not embody all five execution phase, but only what is required for the instruction types that are executed by a particular 
functional unit. 

[0111] As discussed earlier, the execution of instructions can be defined in terms of delay slots as shown in Table 
22. A delay slot is a CPU cycle that occurs after the first execution phase (El) of an instruction in which results from 
the instruction are not available. For example, a multiply instruction has 1 delay slot, this means that there is 1 CPU 
cycle before another instruction can use the results from the multiply instruction. 



Table 22. 



Delay Slot Summary 


Instruction Type 


Delay Slots 


Execute Stages Used 


Branch (The cycle when the target enters El) 


5 


E1-branch target E1 


Load (LD) (Incoming Data) 


4 


E1-E5 
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Table 22. (continued) 



Delay Slot Summary 


Instruction Type 


Delay Slots 


Execute Stages Used 


Load (LD) (Address Modification) 


0 


E1 


Multiply 


1 


E1 - E2 


Single-cycle 


0 


E1 


Store 


0 


E1 


NOP (no execution pipeline operation) 






STP (no CPU internal results written) 







15 [0112] Single cycle instructions execute during the E1 phase of the pipeline. The operand is read, operation is per- 
formed and the results are written to a register all during E1. These instructions have no delay slots. 
[0113] Multiply instructions complete their operations during the E2 phase of the pipeline. In the E1 phase, the op- 
erand is read and the multiply begins. In the E2 phase, the multiply finishes, and the result is written to the destination 
(dst) register. Multiply instructions have 1 delay slot. 

20 [0114] Load instructions have two results: data loaded from memory and address pointer modification. 

[0115] Data loads complete their operations during the E5 phase of the pipeline. In the El phase, the address of the 
data is computed. In the E2 phase, the data address is sent to data memory. In the E3 phase, a memory read is 
performed. In the E4 stage, the data is received at the CPU core boundary. Finally, in the E5 phase, the data is loaded 
into a register. Because data is not written to the register until E5. these instructions have 4 delay slots. Because pointer 

25 results are written to the register in E1 . there are no delay slots associated with the address modification. 

[0116] Store instructions complete their operations during the E3 phase of the pipeline. In the E1 phase, the address 
of the data is computed. In the E2 phase, the data address is sent to data memory. In the E3 phase, a mennory write 
is performed. The address modification is performed in the El stage of the pipeline. Even though stores finish their 
execution in the E3 phase of the pipeline, they have no delay slots and follow the following rules (/ = cycle): 

30 

1) When a load is executed before a store, the old value is loaded and the new value is stored. 

2) When a store is executed before a load, the new value is stored and the new value is loaded. 

3) When the instructions are in are in parallel, the old value is loaded and the new value is stored, 

35 [0117] Branch instructions execute during the E1 phase of the pipeline five delay slots/CPU cycles after the branch 
instruction enters an initial E1 phase of the pipeline. Figure 12 shows the branch instruction phases. Figure .13 shows 
the operation of the pipeline based on clock cycles and fetch packets. In Figure 13. if a branch is in fetch packet n, 
then the El phase of the branch is the PG phase of n+6. In cycle 7 n is in the E1 phase and n+6 is in the PG phase. 
Because the branch target is in PG on cycle 7, it will not reach El until cycle 13. Thus, it appears as if the branch takes 

40 six cycles to execute, or has five delay slots. 

[0118] In Figure 14. fetch packet n. which contains three execute packets, is shown followed by six fetch packets 
(n+1 through n+6). each with one execution packet (containing 8 parallel instructions). The first fetch packet (n) goes 
through the program fetch phases during cycles 1-4. During these cycles a program fetch phase is started for each of 
the following fetch packets. 

45 [0119] In cycle 5. the program dispatch (DP) phase, the CPU scans the p bits and detects that there are three execute 
packets (k thru k+2) in fetch packet n. This forces the pipeline to stall, which allows the DP phase to start execute 
packets k+1 and k+2 in cycles 6 and 7. Once execute packet k+2 is ready to move on to the DC phase (cycle 8) the 
pipeline stall is released. 

[0120] The fetch packets n + 1 through n+4 were all stalled so the CPU would have time to perform the DP phase for 
50 each of the three execute packets (k thru k+2) in fetch packet n. Fetch packet n+5 was also stalled in cycles 6 and 7; 
it was not allowed to enter the PG phase until after'the pipeline stall was released in cycle 8. The pipeline will continue 
as shown with fetch packets n+5 and n+6 until another fetch packet containing multiple execution packets enters the 
DP phase, or an interrupt occurs. 

[0121] Pipeline discontinuities are caused by memory stalls, multi-cycle NOPs. and the STP instruction. During a 
55 memory stall, the CPU cycle (which normally occurs during one clock cycle) occurs in two or more cycles. Alt of the 
pipeline phases stall during these extra clock cycles. The results of the program execution are identical with or without 
the stall. With a memory stall execution will take more clock cycles to finish. 

[0122] The NOP count instruction provides count cycles of NOPs. If count> 2. the NOP is a multi-cycle NOR A NOP 
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2 for example fills ,n extra delay slots for the Instructors in the execute packet it is contained in and for all previous 
execute packets. Thus, ,f a NOP 2 is in parallel with an MPY instruction, the MPVs results will be available for usTby 

nSpsT^h". it 'I'T '^'f " '""^ ''^""^^ ^"^''^ ^ "^^'«--y<=l« NOP is still dispatching 

NOPs into the pipeline, the branch overrides the multi-cycle NOP and the branch target begins execution afte's delay 

i°nf S, >f ^ ^^""T^i i"^'^"<="o" that can be used only when both these conditions are met: 1 ) it cannot contain 

second Hi^ . r r T ' "° P^°9^^^ '^'^^ is happening because either its 

second delay slot contains a mult-cycle NOP or its third and fourth delay slot execute packets are in the same fetch 



Memory System 



[0124] DSP program memory system 23 includes 64K bytes of mamory and a memory/cache controller The program 
memory can operate as either a 64K byte internal program memory or as a direct mapped program cache There are 
our modes under which the program memory system operates: program memory mode; cache enable mode; cache 
freeze mode, and cache bypass mode. The mode under which the program memory operates is determined bv the 
program cache control (PCC) field (bits 5-7) in the CSR (Figure 4). Table 23 shows various PCC values foTSlgur ng 
program memory system 23. ounrigunng 



Table 23. 



Program and Dat 


a Cache Fields 


Bit Field Counter 


Program Cache (PCC) 


Data Cache Control 
(DCC) 


Mode Name 


Description 


000 


PCM 


DCM 


Cache mapped 


Cache mannpH intn 
memory. Cache not 
updated. Cache not 
accessed (except as 
mapped into memory). 
Tags can either be 
invalidated 
automatically or 
invalidated by 
preceding or following 
flush (depending on 
cache 

implementation). 


001 


PCFL 


DCFL 


Cache flush 


Entire cache 
invalidated. Cache 
bypassed for 
accesses. Cache not 
updated on accesses. 


010 


PCE 


DCE 


Cache enable 


Cache accessed. 
Cache updated. 


011 


PCFR 


DCFR 


Cache freeze 


Cache accessed on 
reads. Cache not 
updated on reads. 
Whether the cache is 
updated or invalidated 
on writes depends on 
the specific device. 
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Table 23. (continued) 



Program and Data Cache Fields 


bit rielQ uounter 


rrogram oacne [r^\^\^) 


(DCC) 






100 


PCB 


DCB 


Cache bypass 


Cache not accessed. 
Cache not updated on 
reads. Write action is 
the sanne as for cache 
freeze. 


101 


Reserved 


Reserved 






110 


Reserved 


Reserved 






111 


Reserved 


Reserved 







[0125] When the PCC field of the CSR contains the value 000b, the program memory is mapped as valid program 
memory space. The addresses that make up the program memory map depend on the value on the MAP_BOOT pin 
on the device. 

20 

Emulation Features 

[0126] An aspect of the present invention involves novel and improved techniques for emulating the operation of 
DSP 1 in order to develop software programs or to test DSP 1 for correct operation. Portions of DSP 1 which relate to 
2^ emulation will now be described in more detail. 

[0127] Referring again to Figure 1 . CPU 1 0 has emulation circuitry 50 and interrupt circuitry 90 to support the following 
emulation functions which will be described in more detail: execution and scan control through the test ports; analysis 
support; and real-time emulation support. 

[0128] Execution and scan control through the test ports involves halting CPU 10. CPU Halt Support is provided in 
5^ the following ways: A RDY-based CPU halt based on a software breakpoint (SWBP) or an analysis event. . 

[0129] Analysis support includes the following: a single, exact-match hardware program address breakpoint (PABP): 
analysis events which can be triggered by the EMUOIN or EMU1IN inputs from the Megamodule Test Access Port 
(MTAP). or by a program address breakpoint; and a Special Emulation Event input (SEE) which can trigger'an analysis 
event. 

35 [0130] Real-time emulation support includes message passing and a CPU Analysis Interrupt (AINT) based on a 
software interrupt, an analysis event, or the next cycle boundary. 

[0131] Referring now to Figure 15. emulation circuitry 50 is illustrated in more detail. Megamodule Test Access Port 
(MTAP) 305 is connected to CPU test port (CPUTP) 310, analysis test port (ATP) 320, and Megamodule test port (ATP) 
330. Three domains. CPU domain 1 0. analysis domain 321 . and Megamodule domain 331 are interconnected with the 
test ports. MTAP 305 provides scan and execution control to the various domains within the megamodule. The test 
ports provide an interface for each domain to the MTAP. Test ports also generate, distribute, and perform clock switching 
functions for the functional and scan clocks on the megamodule. MTAP 305 provides an interface between XDS 51 
and the CPU real-time analysis and message passing features. MTAP 305 provides data streaming for fast memory 
download/upload. MTAP 305 also supports performance analysis through an event counter and test port control of 
'^5 execution and clocking for both emulation and test. The operation and design of emulation circuitry 50. including MTAP 
305 and test ports 310, 320 and 330 is described in greater detail in U.S. Patent Serial No. 08/974.630. (TI-24948). 
Processor Test Port with Scan Chains and Data Streaming, issued to D. Deao, et al, and is incorporated herein by 
reference. 

[0132] Figure 17 is a block diagram of an alternative embodiment of a digital system 1700 with processor core 10 
50 of Figure 1. A direct mapped program cache 1710. having 16 kbytes capacity, is controlled by L1 Program (LIP) 
controller 1711 and connected thereby to the instruction fetch stage 10a. A 2-way set associative data cache, having 
a 16 kbyte capacity, is controlled by LI Data (L1D) controller 1721 and connected thereby to data units 01 and D2. 
An L2 memory 1730 having four banks of memory. 128 Kbytes total, is connected to LIP 1711 and to L1D 1721 to 
provide storage for data and programs. External memory interface (EMIF) 1 750 provides a 64 bit data path to external 
55 memory, not shown, which provides memory data to L2 memory 1730 via extended direct memory access (DMA) 
controller 1740. 

[0133] EMIF 1752 provides a 16 bit interface for access to external peripherals, not shown. Expansion bus 1770 
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provides host and I/O support similarly to host port 60/80 of Figure 1 . Three multi-channel buffered serial ports (McBSP) 
1760. 1762. 1764 are connected to DMA controller 1 740. A detailed description of a McBSP is provided in U.S. Patent 
Serial No. 09/055,011 (TI-26204. Seshan. et al) and is incorporated herein reference. 

Detailed Description of Extended Instructions 

[0134] As discussed earlier, the set of extended instructions included in the present embodiment of the invention 
increases orthogonality of the instruction set. provides data flow enhancements with 8-bit and 16-bit extension, and 
provides instructions that reduce code size and increase register flexibility. 

[0135] For example, there is an extensive collection of PACK and byte shift instructions to simplify manipulation of 
packed data types. A set of instructions is provided that operate directly on packed data to streamline data flow and 
increase instruction set efficiency. They are summarized in Table 24 below: 



Table 24. 



Instruction for Operating Directly on Packed Data 


Operation 


Quad 8-bit 


Paired 16-bit 


Multiply 


X 


X 


Multiply with Saturation 




X 


Addition/Subtraction 


X 


X * 


Addition with Saturation 


X 


X 


Absolute Value 




X 


Subtract with Absolute Value 


X 




Compare 


X 


X 


Shift 




X 


Data Pack/Unpack 


X 


X 


Data Pack with Saturation 


X 


X 


Dot product with optional negate 


X 


X 


M in/Max/Average 


X 


X 


Bit-expansion (Mask generation) 


X 


X 



[0136] Table 25 contains a detailed description of each extended instruction of the improved Instruction Set Archi- 
tecture (ISA). 



Table 25 - Extended Instruction Set Description 



ABS2 






Absolute 


Value With 


Saturation, Signed Packed 16-bit: 


The absolute 


values of 


the upper and lower halves of the 


src2 operand 


are placed 


in the upper and lower halves of 


the dst. 
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ADD4 

Add Four 8-bit Pairs for Four 8-bit Results: The ADD4 
instruction performs 2 ' s-conplement addition between packed 
8-bit quantities. The values in srcl and src2 are treated 
as packed 8-bit data and the results are written in packed 
8-bit format. 

For each pair of packed 8-bit values found in srcl and 
src2, the sum between the 8-bit value from srcl and the 8- 
bit value from src2 is calculated to produce an 8-bit 
result. The result is placed in the corresponding positions 
in dst. No saturation is performed. The carry from one 8- 
bit add does not affect the add of any other 8-bit add. 
ADDKPC 

Add a Signed 7-bit Constant to Program Counter: In the 
ADDKPC instruction, a 7-bit signed constant is shifted 2 
bits to the left , then added to the address of the first 
instruction of the fetch packet that contains the ADDKPC 
instruction (PCEl). The result is placed in the 

destination register. The 3-bit unsigned constant defines 
the number of parallel nops. This instruction helps reduce, 
the number of instructions needed to set up the return 
address for a function call. 
ANDN 

Bit-Wise Logical AND Invert: In the ANDN instruction, a 
bitwise logical AND is performed between srcl and the 
bitwise logical inverse of src2. The result is placed in 
dst . 
AVG2 

Average, Signed Packed 16-bit: The AVG2 instruction 
performs an averaging operation on packed 16-bit data. For 
each pair of signed 16-bit values found in srcl and src2, 
AVG2 calculates the average of the two values and returns a 
signed 16-bit quantity in the corresponding position in the 
dst 
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AVGU4 



Average, Unsigned Packed 8-bit: The AVGU4 instruction 
performs an averaging operation on packed 8-bit data. The 
values in srcl and src2 are treated as unsigned packed 8- 
bit data and the results are written in unsigned packed 8- 
bit format. For each unsigned packed 8-bit values found in 
srcl and src2, AVGU4 calculates the average of the two 
values and returns an unsigned 8-bit guantity in the 
corresponding positions in the dst . 



BDEC 



Branch and Decrement: If the predication and decrement 
register (dst) is positive (greater than or equal to 0) , 
the BDEC instruction performs a relative branch and 
decrements dst by one. 

The instruction performs the relative branch using a 10- 
bit signed constant specified by the scstlO. The constant 
is shifted 2 bits to the left, then added to the address of 
the first instruction of the fetch packet that contains the 
BDEC instruction (PCEl). The result is placed in the 
program fetch counter (PFC) . 

This instruction helps reduce the number of instructions 
needed to decrement a register and branch conditionally 
based upon that register's value. Note also .that any 
register can be used which can free the predicate registers 
(A0-A2 and B0-B2) for other uses. 



BITC4 



Bit Count, Packed 8-bi-t: The BITC4 performs a bit-count 
operation on 8-bit quantities. The value in src2 is treated 
as packed 8-bit data, and the result is written in packed 
8-bit format. For each of the 8-bit quantities in src2, the 
count of the number of "1" bits in that value is written to 
the corresponding position in dst. 
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BITR 

Bit Reverse: The BITR instruction implements a bit- 
reversal. A bit-reversal function reverses the order of 
bits in a 32-bit word. In other words, bit 0 of the source 
becomes bit 31 of the result, bit 1 of the source becomes 
bit 30 of the result, bit 2 becomes bit 29, and so on. 
BNOP 

Branch Using a Displacement with NOP: The constant 
displacement form of the BNOP instruction performs a 
relative branch with NOPs. The instruction performs the 
relative branch using the 12-bit signed constant specified 
by src2. The constant is shifted 2 bits to the left, then 
added to the address of the first instruction of the fetch 
packet that contains the BNOP instruction (PCEl) . The 
result is placed in the program fetch counter (PFC) . 
The 3-bit unsigned constant specified in srcl, gives the 
number of delay slot NOPs to be inserted, from 0 to 5 . With 
srcl = 0, no NOP cycles are inserted. 

This instruction helps reduce the number of instructions 
to perform a branch when NOPs are required to fill the 
delay slots of a branch 

The register form of the BNOP instruction performs an 
absolute branch with NOPs. The register specified in src2 
is placed in the program fetch counter (PFC). 
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BPOS 

Branch Positive: If the predication register (dst) is 
positive (greater than or equal to 0), the BPOS instruction 
performs a relative branch. If dst is negative, the BPOS 
instruction takes no other action. 

The instruction performs the relative branch using a 10- 
bit signed constant specified by the scstlO. The constant 
is shifted 2 bits to the left, then added to the address of 
the first instruction of the fetch packet that contains the 
BPOS instruction (PCEl) . The result is placed in the 
program fetch counter (PFC). 

Note,, that any register can be used which can free the 
predicate registers (A0-A2 and B0-B2) for other uses. 
CMPEQ2 — — _ 

Compare if Equal, Packed 16-bit: The CMPEQ2 instruction 
performs equality comparisons on packed 16-bit data. Each 
16-bit value in srcl is compared against the corresponding 
16-bit value in src2, returning a 1 if equal or 0 if not 
equal. The equality results are packed into the two least- 
significant bits of dst. The result for the lower pair of 
values is placed in bit 0, and the results for the upper 
pair of values are placed in bit 1. The remaining bits of 
dst are set to 0 . 
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CMPEQ4 ' 

Compare if. Equal, Packed 8-bit: The CMPEQ4 instruction 
performs equality comparisons on packed 8-bit data. Each 8- 
bit value in srcl is compared against the corresponding 8- 
bit value in src2, returning a 1 if equal or 0 if not 
equal. The equality results are packed into the four least- 
significant bits of dst. 

The 8-bit values in each input are numbered from 0 to 3 
starting with the least-signif leant _byte, working towards 
the most- significant byte. The comparison results for byte 
0 are written to bit 0 of the result. Likewise, the results 
for byte 1 to 3 are written to bits 1 to 3 of the result, 
respectively, as shown in the diagram below. The remaining 
bits of dst are set to 0. 



CMPGT2 



Compare for Greater Than, Packed 16-bit: The CiyiPGT2 
instruction performs greater than comparisons on signed 
packed 16-bit data. Each signed 16-bit value in srcl is 
compared against the corresponding signed 16-bit value in 
src2, returning a 1 if srcl is greater than src2 v or 
returning a 0 if it is not greater. The comparison results 
are packed into the two least-significant bits of dst. the 
result for the lower pair of values is placed in bit 0, and 
the results for the upper pair of values are placed in bit 
1. The remaining bits of dst are set to 0. 
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CMPGTU4 

Compare for Greater Than, Unsigned Packed 8-bit: The 
CMPGTU4 instruction performs greater than comparisons on 
packed 8-bit data. Each unsigned 8-bit value -in srcl is 
compared against the corresponding 8-bit unsigned value in 
src2, returning a 1 if the byte in srcl is greater than the 
corresponding byte in src2 or 0 if is not greater than. The 
comparison results are packed into the four least- 
significant bits of dst. 

The 8-bit values in each input are numbered from 0 to 3 
starting with the least-significant byte, working towards 
the most-significant byte. The comparison results for byte 
0 are written to bit 0 of the result. Likewise, the results 
for byte 1 to 3 are written to bits 1 to 3 of the result, 
respectively, as shown in the diagram below. The remaining 
bits of dst are set to 0. 

CMPLT2 ~ ; 

Compare for Less Than, Packed 16-bit (Pseudo- 
Operation) : CMPLT2 is a pseudo-operation used to perform 
less than comparisons on signed packed 16-bit data. Each 
signed 16-bit value in src2 is compared against the 
corresponding signed 16-bit value in srcl, returning a 1 if 
src2 is less than srcl or returning a 0 if it is not less 
than. The comparison results are packed into the two least- 
significant bits of dst. The result for the lower pair of 
values is placed in bit 0, and the results for the upper 
pair of values are placed in bit 1. The remaining bits of 
dst are set to 0, The assembler uses the operation CMPGT2 
(.unit) srcl , src2 , dst to perform this task. 
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CMPLTU4 

Compare for Less Than, Unsigned Packed 8-bit (Pseudo- 
Operation) : CiyiPLTU4 is a pseudo-operation that performs 
less than comparisons on packed 8-bit data. Each unsigned 
8-bit value in src2 is compared against the corresponding 
8-bit unsigned value in srcl, returning a 1 if the byte in 
src2 is less than the corresponding byte in srcl or 0 if is 
not less than. The comparison results are packed into the 
four least-significant bits of dst. 

The 8-bit values in each input are numbered from 0 to 3 
starting with the least-significant byte, working towards 
the most-significant byte. The comparison results for byte 
0 are written to bit 0 of the result. Likewise, the results 
for byte 1 to 3 are written to bits 1 to 3 of the result, 
respectively, as shown in the diagram below. The remaining 
bits of dst are set to 0. 

The assembler uses the instruction CMPGTU4 (.unit) 
srcl , src2 , dst to perform this task 

DEAL ~~ 
De-interleave and Pack: The DEAL instruction performs'" a 
de-interleave and pack operation on the bits in src2. The 
odd and even bits of src2 are extracted into two separate 
16-bit quantities. These 16-bit quantities are then packed 
such that the even bits are placed in the lower half-word, 
and the odd bits are placed in the upper half word. 

As a result, bits 0, 2, 4, ... / 28, 30 of src2 are 
placed in bits 0, 1, 2, ... , 14, 15 of dst. Likewise, bits 
1, 3, 5, 29, 31 of src2 are placed in bits 16, 17, 18, 

30, 31 of dst. 

The DEAL instruction is the exact inverse of the SHFL 
instruction . 
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D0TP2 

Dot Product, Signed Packed 16-bit: The D0TP2 instruction 
returns the dot-product between two pairs of signed packed 
16-bit values. The values in srci and src2 are treated as 
signed packed 16-bit quantities. The signed result is 
written either to a single 32-bit register, or sign- 
extended into a 64-bit register pair. 

The product of the lower half-words of srcl and src2 is 
added to the product of the upper half-words of srcl and 
src2. The result is then written to dst. 

In the 64-bit result version, the upper word of the 
register pair always contains either all O's or all I's, 
depending on whether the result is positive or negative, 
respectively. 

The 32-bit result version returns the same results, as 
the 64-bit result version does in the lower 32 bits. The 
upper 32-bits are discarded. 
D0TPN2 " ' ' 

Dot ' Product with Negate, Signed Packed 16-bit: The 
D0TPN2 instruction returns the dot-product between two 
pairs of signed packed 16-bit values where the second 
product is negated. The values in srcl and src2 are treated 
as signed packed 16-bit quantities. The signed result is 
written either to a single 32-bit register. 

The product of the lower half-words of srci and src2 is 
subtracted from the product of the upper half-words of srcl 
and src2. The result is then written to dst. 
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D0TPNRSU2 

Dot Product with Negate, Shift and Round, Signed by 
Unsigned Packed 16-bit: The DOTPNRSU2 instruction returns 
the dot-product between two pairs of packed 16-bit values, 
where the second product is negated. This instruction takes 
the result of the dot-product, and performs an addiricnal 
round and shift step. The values in srci are treated as 
signed packed 16-bit quantities, whereas the values in src2 
are treated as unsigned packed 16-bit quantinies. The 
results are written to dst. 

The product of the lower half-words of srci and src2 is 
subtracted from the product of the upper half-words of 
srci and src2. The value 2^^ is then added to this sum, 
producing an intermediate 32-bic result. The intermediate 
result is signed shifted right by 16, producing a rounded, 
shifted result that is sign extended and placed in dst. 



D0TPNRUS2 

Dot Product with Negate, Shift and Round, Unsigned by 
Signed Packed 16-bit: The D0TPNRUS2 pseudo-operation 
performs the dot-product between two pairs of packed 16-bit 
values, where the second product is negated. This 
instruction takes the result of the dot-product, and 
performs an additional round and shift step. The values in 
srci are treated as signed packed 16-bit quantities, 
whereas the values in src2 are treated as unsigned packed 
16-bit quantities. The results are written to dst. The 
assembler uses the D0TPNRSU2 srci, src2, dst instruction 
to perform this task. 

The product of the lower half-words of srci and src2 is 
subtracted from the product of the upper half-words of 
srci and src2. The value 2^^ is then added to this sum, 
producing an intermediate 32-bit result. The intermediate 
result is signed shifted right by 16, producing a rounded, 
shifted result that is sign extended and placed in dst. 
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D0TPRSU2 ~ ^ 

Dot Product with Shift and Round, Signed by Unsigned 
Packed 16-bit: The DOTPRSU2 instruction returns the dot- 
product between two pairs of packed 16-bit values. This 
instruction takes the result of the dot-product, and 
performs an additional round and shift step. The values in 
srcl are treated as signed packed 16-bit quantities, 
whereas the values in src2 are treated as unsigned packed 
16-bit quantities. The results are written to dst. 

The product of the lower half-words of srcl and src2 is 
added to the product of the upper half-words of srci and 
src2. The value 2^^ is then added to this sum, producing an 
intermediate 32-bit result. The intermediate result is 
signed shifted right by 16, producing a rounded, shifted 
result that is sign extended and placed in dst. 

D0TPRUS2 ~~ ~ 

Dot Product with Shift and Round, Unsigned by Signed 
Packed 16-bit (Pseudo-Operation) : The DOTPRUS2 pseudo- 
operation returns the dot-product between two pairs of 
packed 16-bit values. This instruction takes the result of 
the dot-product, and performs an additional round and shift 
step. The values in srcl are treated as signed packed 16- 
bit quantities, whereas the values in src2 are treated as 
unsigned packed 16-bit quantities. The results are written 
to dst. . The assembler uses the D0TPRSU2 srcl, src2, dst 
instruction to perform this task. 

The product of the lower half-words of srcl and src2 is 
added to the product of the upper half-words of srci and 
src2. The value 2^^ is then added to this sum, producing an 
intermediate 32-bit result. The intermediate result is 
signed shifted right by 16, producing a rounded, shifted 
result that is sign extended and placed in dst. 
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D0TPSU4 ' ~ - — ■ " — 

Dot Product, Signed by Unsigned Packed 8-bit: The 
DOTPSU4 instruction returns the dot-product between four 
sets of packed 8-bit values. The values in srcl are treated 
as signed packed 8-bit quantities, whereas the values in 
src2 are treated as unsigned 8-bit packed data. The signed 
result is written into dst . 

For each pair of 8-bit quantities in srcl and src2, the 
signed 8-bit value from srcl is multiplied with the 
unsigned 8-bit value from src2. The four products are- 
summed together, and the resulting dot product is written 
as a signed 32-bit result to dst. 



D0TPUS4 



Dot Product, Unsigned by Signed Packed 8-bi t ( Pseudo- 
.Operation: The D0TPSU4 pseudo-operation returns the dot- 
product between four sets of packed 8-bit values. The 
values in srcl are treated as signed packed .8-bit 
quantities, whereas the values in src2 are treated as 
unsigned 8-bit packed data. The signed result is written 
into dst. The assembler uses the D0TPSU4 srcl, src2, dst 
instruction to perform this task. 

For each pair of 8-bit quantities in srcl and src2, the 
signed 8-bit value from srcl is multiplied with the 
unsigned 8-bit value from src2. The four products are 
summed together, and the resulting dot product is written 
as a signed 32-bit result to dst. 
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D0TPU4 

Dot Product, Unsigned Packed 8-bit: The DOTPU4 
instruction returns the dot-product between four sets of 
packed 8-bit values. The values in both srci and src2 are 
treated as unsigned 8-bit packed data. The unsigned result 
is written into dst. 

For each pair of 8-bit quantities in srcl and src2, the 
unsigned 8-bit value from srcl is multiplied with the 
unsigned 8-bit value from src2. The four products are 
summed together, and the resulting dot product is written 
as a 32-bit result to dst. 

"G1MPY4 " ~ ~ 

Galois Field Multiply, Packed 8-bit: The GMPY4 
instruction performs the Galois field multiply on four 
values in srcl with four parallel values in src2 . The four 
products are packed into dst. The values in both srcl and 
src2 are treated as unsigned 8-bit packed data. 

For each pair of 8-bit quantities in srcl and src2, the 
unsigned 8-bit value from srcl is Galois field multiplied 
with the unsigned 8-bit value from src2. The product of 
srcl byte O.and src2 byte 0 is written to byteO of dst. The 
product of srcl byte 1 and src2 byte 1 is written to bytel 
of dst. The product of srci byte 2 and src2 byte 2 is 
written to byte2 of dst. The product of srcl byte 3 and 
src2 byte "3 is written to the most significant byte in dst. 
LDNDW 

Load Non-Aligned Double Word: The LDNDW instruction 
loads a 64-bit quantity from memory into a register pair, 
dst_o:dst_e. The LDNDW instruction may read a 54-bit value 
from any byte boundary. Thus alignment to a 64-bit boundary 
is not required. The memory effective address is formed 
from a base address register (baseR) and an optional offset 
that is either a register (offsetR) or a 5-bit unsigned 
constant (ucst5) . 
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LDNW 

Load Non-Aligned Word: The LDNW instruction loads a 32- 
bit quantity from memory into a 32-bit register, dst. The 
LDNW instruction may read a 32-bit value from any byte 
boundary. Thus alignment to a 32-bit boundary is not 
required. The effective memory address is formed from a 
base address register (baseR) and an optional offset that 
is either a register (offsetR) or a 5-bit unsigned constant 
(ucstS) . If an offset is not given, the assembler assigns 
an offset of zero. 

MAX 2 ~~~ ~~~~~~ 

Maximum, Signed Packed 16-bit: The MAX2 performs a 
maximum operation on packed signed 16-bit values. For each 
pair of signed 16-bit values in srcl and src2, MAX2 places 
the larger value in the corresponding position in dst. 
MAXU4 

Maximum, Unsigned Packed 8-bit: The MAXU4 performs,', a 
maximum operation on packed unsigned 8-bit values. For each 
pair of unsigned 8-bit values in srcl and src2, . MAXU4 
places the larger value in the corresponding position -in 
dst. 
MIN2 

Minimum, Signed Packed 16-bit : The MIN2 performs a 
minimum operation on packed signed 16-bit values. For each 
pair of signed 16-bit values in srcl and src2, MIN2 places 
the smaller value in the corresponding position in dst. 
MINU4 ^ ~~ ' 

Minimum, Unsigned Packed 8-bit: The MINU4 performs a 
minimum operation on packed unsigned 8-bit values- For each 
pair of unsigned 8-bit values in srcl and src2, MINU4 
places the smaller value in the corresponding position in 
dst. 
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MPY2 " ' ~~ — 

Multiply Signed by Signed, Packed 16-bit: The MPY2 
instruction performs two 16-bit by 16-bit multiplication 
between two pairs of signed packed 16-bit values. The 
values in srcl and src2 are treated as signed packed 16-bit 
quantities. .The 32-bit results are written into a 64-bit 
register pair. 

The product of the lower half-words of srcl and src2 is 
written to the even destination register, dst e. The 
product of the upper half-words of srcl and src2 is written 
to the odd destination register, dst o. 

MPYHI " ' ^ — 

Multiply 16 iyiSB x 32-bit into 64-bit Result: The MPYHI 
instruction performs a 16-bit by 32-bit multiply. The upper 
half of srcl is used as a 16 bit signed input. The value 
in src2 is treated as a 32-bit signed value. The result is 
written into the lower 48 bits of a 64-bit register pair, 
dst_o:dst_e, and sign extended to 64 bits. 

MPYHIR " ' — — 

Multiply 16 MSB x 3.2-bit, Shifted by 15 to Produce a 
Rounded 32-bit Result: The MPYHIR instruction performs a 
16-bit by 32-bit multiply. The upper half of srcl is used 
as a 16-bit signed input. The value in src2 is treated as 
a 32-bit signed value. The product is then rounded to a 32- 
bit result by adding the value 2^^ and then this sum is 
right shifted by 15. The lower 32-bits of the result are 
written into dst. 
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MPYIH 

Multiply 32-bit x 16 MSB' into 64-bit Result ( Pseudo- 
Operation): The MPYIH pseudo^opera t ion performs a 16-bit by 
32-bit multiply. The upper half of srcl is used as a 16 bit 
signed input. The value in src2 is treated as a 32-bit 
signed value. The result is written into the lower 48 bits 
of a 64-bit register pair, dst_o:dst_e, and sign extended 
to 64 bits. The assembler uses the MPYHI srcl, src2, dst 
instruction to perform this task. 
MPYIHR 

Multiply 32-bit x 16 MSB, Shifted by 15 to Produce a 
Rounded 32-bit Result (Pseudo-Operation): The MPYIHR 
pseudo-operation performs a 16-bit by 32-bit multiply. The 
upper half of srcl is used as a 16-bit signed input. The 
value in src2 is treated as a 32-bit signed value. The 
product is then rounded to a 32-bit result by adding the 
value 2^^ and then this sum is right shifted by 15. The 
lower 32-bits of the result are written into dst. The 
assembler uses a MPYHIR srcl, src2, dst to perform this 
operation. 

MPYIL ~~ 
Multiply 32-bit x 16 LSB into 64-bit Result (Pseudo- 
Operation: The MPYIL pseudo-operation performs a 16-bit by 
32-bit multiply. The lower half of srcl is used as a 16 bit 
signed input. The value in src2 is treated as a 32-bit 
signed value. The result is written into the lower 48 bits 
of a 64-bit register pair, dst_o : dst__e , and sign extended 
to 64 bits. The assembler uses the MPYLI srcl, src2, dst 
instruction to perform this operation. 
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MPYILR 

Multiply 32-bit x 16 LSB, Shifted by 15 to Produce a 
Rounded 32-bit Result (Pseudo-Operation) : The MPYLIR 
pseudo-operation performs a 16-bit by 32-bit multiply. The 
lower half of srcl is used as a 16-bit signed input. The 
value in src2 is treated as a 32-bit signed value. The 
product is then rounded to a 32-bit result by adding the 
value 2^^ and then this sum right shifted by 15. The lower 
32-bits of the result are written into dst. The assembler 
uses a MPYLIR srcl, src2, dst instruction to perform this 
operation . 

MPYLi " ' " ; 

Multiply 16 LSB x 32-bit into 64-bit Result: The MPYLI 
instruction performs a 16-bit by 32-bit multiply. The lower 
half of srcl is used as a 16 bit signed input. The value 
in src2 is treated as a 32-bit signed value. The result is 
written into the lower 48 bits of a 64-bit . register pair, 
dst_o:dst_e, and sign extended to 64 bits. 
MPYLIR 

Multiply 16 LSB x 32-bit, Shifted by 15 to Produce a 
Rounded 32-bit Result: The MPYLIR instruction performs a 
16-bit by 32-bit multiply. The lower half of srcl is used 
as a 16-bit signed input. The value in src2 is treated as 
a 32-bit signed value. The product is then rounded to a 32- 
bit result by adding the value 2^^ and then this sum right 
shifted by 15. The lower 32-bits of the result are written 
into dst. 
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MPYSU4 

Multiply Signed by Unsigned Packed, 8-bit: The MPYSU4 
instruction returns the product between four sets of packed 
8-bit values producing four signed 16-bit results. The four 
signed 16-bit results are packed into a 64-bit register 
pair, dst_o:clst__e. The values in srcl are treated as 
signed packed 8-bit quantities, whereas the values in src2 
are treated as unsigned 8-bit packed data. 

For each pair of 8-bit quantities in srci and src2, the 
signed 8-bit value from srci is multiplied with the 
unsigned 8-bit value from src2. The product of srcl byte 0 
and src2 byte 0 is written to the lower half of dst_e. The 
product of srcl byte 1 and src2 byte 1 is written to the 
upper half of dst__e. The product of srcl byte 2 and src2 
byte 2 is written to the lower half of dst_o. The product 
of srcl byte 3 and src2 byte 3 is written to the upper half, 
of dst o. - 
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MPYUS4 

Multiply Unsigned by Signed Packed, 8-bit (Pseudo- 
j Operation) : The MPYU34 pseudo-operation returns the product 
between four sets of packed 8-bit values producing four 
signed 16-bit results. The four signed 16-bit results are 
packed into a 64-bit register pair, dst_o:dst_e. The 
values in srcl are treated as signed packed 8-bit 
quantities, whereas the values in src2 are treated as 
unsigned 6-bit packed data. The assembler uses a F1PYSU4 
srcl, src2, dst instruction to perform this operation. 

For each pair of 8-bit guantities in srci and src2, the 
signed 8-bit value from srcl is multiplied with the 
unsigned 8-bit value from src2,' The product of srci byte 0 
and src2 by-e 0 is written to the lower half of dsc_e. The 
product of srci byte 1 and src2 byte 1' is written to the 
upper half of dst_e. The product of srcl byte 2 and src2 
byte 2 is written to the lower half of dst_o. The product 
of srcl byte 3 and src2 byte 3 is written to the upper half 
"of dst_o. 
MPYU4 

. Multiply Unsigned by Unsigned Packed, 8-bit: The MPYU4 
instruction returns the product between four sets of packed 
8-bit values producing four unsigned 16-bit results that 
are packed into a 64-bit register pair, dst_o:dst_e. The 
values in both srcl and src2 are treated as unsigned 8-bit 
packed data. 

For each pair of 8-bit quantities in srcl and src2, the 
unsigned 8-bit value from srcl is multiplied with the 
unsigned 8-bit value from src2. The product of srcl byte 0 
and src2 byte 0 is written to the lower half of dst_e. The 
product of srcl byte 1 and src2 byte 1 is written to the 
upper half of dst_e. The product of srcl byte 2 and 5rc2 
byte 2 is written to the lower half of dst_o. The product 
of srcl byte 3 and src2 byte 3 is written to the upper half 
of dst o. 
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MVD 

Move from Register to Register, Delayed: The MVD 
instruction moves data from the src2 register to the dst 
register over 4 cycles. This is done using the multiplier 
path . 
PACK2 

Pack 161sb, 161sb into Packed 16-bit: The PACK2 
instruction takes the lower half-words from srcl and src2 
and packs them both into dst. The lower half-word of srcl 
is placed in the upper half-word of dst. The lower half- 
word of src2 is placed in the lower half-word of dst. 

This instruction is useful for manipulating and 
preparing pairs of 16-bit values to be used by the packed 
arithmetic operations, such as ADD2 . 

PACKH2 ' 
Pack 16msb, 16msb into Packed 16-bit: The PACKH2 
instruction takes the upper half-words from srci and src2 
and packs them both into dst. The upper half-word of srcl 
is placed in the upper half-word of dst. The upper half- 
word of src2 is placed in the lower half-word of dst. 

This instruction is useful for manipulating and preparing 
pairs of 16-bit values to be used by the packed arithmetic 
operations, such as ADD2 . 



49 



EP 1 102 163 A2 



PACKH4 " ' " — — 

Pack High Bytes of Four Half-words into Packed 8-bit: 
The PACKH4 instruction moves the high bytes of the two 
half-words in srcl and src2 and packs them into dst. The 
bytes from srcl will be packed into the most significant 
bytes of dst, and the bytes from src2 will be packed into 
the least significant bytes of dst. 

Specifically, the high byte of the upper half-word of 
srcl is moved to the upper byte of the upper half-word of 
dst. The high byte of the lower half-word of srcl is moved 
to the lower byte of the upper half-word of dst. The high 
byte of the upper half-word of src2 is moved to the upper 
byte of the lower half-word of dst. The high byte of the 
lower half-word of src2 is moved to the lower byte of the 
lower half-word of dst, 

PACKHL2 ~ 

Pack 16msb, 161sb into Packed 16-bit: The PACKHL2 
instruction takes the upper half-word from srcl and the 
lower half-word from 5rc2 and packs them both into dst'. The • 
upper half-word of srcl is placed in the upper half-word of 
dst. The lower half-word of src2 is placed in the lower 
half-wo'rd of dst. 

This instruction is useful for manipulating and 
preparing pairs of 16-bit values to be used by the packed 
arithmetic operations, such as ADD2 . 
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PACKL4 

Pack Low Bytes of Four Half-words into Packed 8-bit: The 
PACKL4 instruction moves the low bytes of the two half- 
words in srcl and src2 and packs them into dst. The bytes 
from srcl will be packed into the most significant bytes of 
dstr and the bytes from src2 will be packed into the least 
significant bytes of dst. 

Specifically, the low byte of the upper half-word of 
srcl is moved to the upper byte of the upper half-word of 
dst. The low byte of the lower half-word of srcl is moved 
to the lower byte of the upper half-word of dst. The low 
byte of the upper half-word of src2 is moved to the upper 
byte of the lower half-word of dst. The low byte of the 
lower half-word of src2 is moved to the lower byte of the 
lower half-word of dst. : ' 

PACKLH2 ~ ~ ""^ " 

Pack 161sb, 16msb into Packed 16-bit: The PACKLH2 
instruction takes the lower half-word from srcl and the 
upper half-word from src2 and packs them both into dst. The. 
lower half-word of srcl is placed in the upper half-word. of 
dst. The upper half-word of src2 is placed in the lower 
half-word of dst. 

This instruction is useful for manipulating and 
preparing pairs of 16-bit values to be used by the packed 
arithmetic operations, such as ADD2 . 
ROTL 

Rotate Left: The ROTL instruction rotates the 32-bit 
value of src2 to the left and places the result in dst. The 
number of bits to rotate is given in the five least- 
significant bits of srcl. Bits 5 through 31 of srcl are 
ignored and may be non-zero. 
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SADD2 

Add with Saturation, Signed Packed 16-bit: The SADD2 
instruction performs 2s-complement addition between signed 
packed 16-bit quantities in srcl and src2. The results are 
placed in a signed packed 16-bit format into dst. 

For each pair of 16-bit quantities in srci and src2, the 
sum between the signed 16-bit value from srcl and the 
signed 16-bit value from src2 is calculated and saturated 
to produce a signed 16-bit result. The result is placed in 
the corresponding position in dst. 
SADDSU2 ' 

Add with Saturation, Signed with Unsigned Packed 16-bit 
(Pseudo-Operation): The SADDSU2 pseudo-operation performs 
2.s-compiement addition between unsigned and signed packed 
16-bit quantities. The values in srci are treated as 
unsigned packed 16-bit quantities and the values in src2 
are treated as signed packed 16-bit quantities. The results 
are placed in an unsigned packed 16-bit format into dst . 
The assembler uses the SADDUS2 srcl, src2, dst instruction 
to perform this operation. 

For each pair of 16-bit quantities in srcl and src2, the 
sum between the unsigned 16-bit value from srci and the 
signed 16-bit value from 5rc2 is calculated and saturated 
to produce a signed 16-bit result. The result is placed in 
the corresponding position in dst. 
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SADDU4 ' ~~ ~~ 

Add with Saturation, Unsigned, Packed 8-bit: The SADDU4 
instruction performs 2s-complement addition between 
unsigned packed 8-bit quantities. The values in srcl and 
src2 are treated as unsigned packed 8-bit quantities. The 
results are written into dst in an unsigned packed 8-bit 
format . 

For each pair of 8-bit quantities in srcl and src2, the 
sum between the unsigned 8-bit value from srcl and the 
unsigned 8-bit value from src2 is calculated and saturated 
to produce an unsigned 8-bit result. The result is placed 
in the corresponding position in dst. 

This operation is performed on each 8-bit quantity 
separately. This instruction does not affect the SAT bit in 
the CSR. 

SADDUS2 ' ' " ~~ 

Add with Saturation, Unsigned with Signed Packed 16-bit.: 
The SADDUS2 instruction performs 2s-complement addition* 
between unsigned and signed packed 16-bit quantities. The 
values in srcl are treated as unsigned packed 16-bit 
quantities and the values in src2 are treated as signed 
packed 16-bit quantities. The results are placed in an 
unsigned packed 16-bit format into dst. 

For each pair of 16-bit quantities in srcl and src2, the 
sum between the unsigned 16-bit value from srcl and the 
signed 16-bit value from src2 is calculated and saturated 
to produce a signed 16-bit result. The result is placed in 
the corresponding position in dst. 
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SHFL 

Shuffle: The SHFL instruction performs an interleave 
operation on the two half-words in src2. The bits in the 
lower half-word of src2 are placed in the even bit 
positions in dst, and the bits in the upper half-word of 
src2 are placed in the odd bit positions in dst. 

As a result, bits 0, 1, 2, 1^/ 15 of src2 are 

placed in bits 0, 2, 4, ... r 28, 30 of dst. Likewise, bits 
16, 17, 18, , . 30, 31 of src2 are placed in bits 1, 3, 5, 
. . 29, 31 of dst. 

The SHFL instruction is the exact inverse of the DEAL 
instruction. 
SHLMB 

Shift Left and Merge Byte: The SHLMB instruction shifts 
the contents of src2 left by one byte, and then the most 
significant byte of srcl is merged into the least 
significant byte position. The result is then placed in 
dst. 
SHR2 

Shift Right, Signed Packed 16-bit: The SHR2 instruction 
performs an arithmetic shift right on signed packed 16-bit 
quantities. The values in src2 are treated as signed packed 
16-bit quantities. The lower five bits of srcl or ucst5 
are treated as the shift amount. The results are placed in 
a signed packed 16-bit format into dst. 

For each signed 16-bit quantity in src2, the quantity is 
shifted right by the number of 'bits specified in the lower 
five bits of srcl or ucstS . Bits 5 through 31 of srcl are 
ignored and may be non-zero. The shifted quantity is sign- 
extended, and placed in the corresponding position in dst. 
Bits shifted out of the least-significant bit of the signed 
16-bit quantity are discarded. 
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SHRMB 

Shift Right and Merge Byte: The SHRMB instruction shifts 
the contents of src2 right by one byte, and then the least 
significant byte of srcl is merged into the most 
significant byte position. The result is then placed in 
dst. 
SHRU2 

Shift Right, Unsigned Packed 16-bit: The SHRU2 
instruction performs an arithmetic shift right on unsigned 
packed 15-bit quantities. The values in src2 are treated as 
unsigned packed 16-bit quantities. The lower five bits of 
srcl or ucstS are treated as the shift amount. The results 
are placed in an unsigned packed 16-bit format into dst. 

For each unsigned 16-bit quantity in src2, the quantity 
is shifted right by the number of bits specified in the 
lower five bits of srcl or ucstS, Bits 5 through 31 of srcl 
are ignored and may be non-zero. The shifted quantity- is. 
zero-extended, and placed in the corresponding position in 
dst. Bits shifted out of the least-significant bit of the 
signed 16-bit quantity are discarded. 
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SMPY2 ~~~~ 

Multiply Signed by Signed, with Left Shift and Saturate, 
Packed 16-bit: The SMPY2 instruction performs two 16-bit by 
16-bit multiplication between two pairs of signed packed 
16-bit values, with an additional left-shift and saturate. 
The values in srcl and src2 are treated as signed packed 
16-bit quantities. The two 32-bit results are written into 
a 64-bit register pair. 

The SMPY2 instruction produces two 16 x 16 products. 
Each product is shifted left by one, and if the left- 
shifted result, is equal to 0x80000000, the output value is 
saturated to Ox7FFFFFFF. 

The saturated product of the lower half-words of srcl 
and src2 is written to the even destination register, 
dst_e. The saturated product of the upper half-words of 
srcl and src2 is written to the odd destination register, 
ds t_o . 

Note: If either product saturates, the SAT bit is set in 
the CSR on the cycle the result is written. If neither 
product saturates, the SAT bit in the CSR is not affected. 

This instruction helps reduce the number of instructions 
required to perform two 16-bit by 16-bit saturated 
multiplies on both the lower and upper halves of two 
registers 
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SPACK2 

Saturate and Pack into Signed Packed 16-bit: The SPACK2 
instruction takes two signed 32-bit quantities in srcl and 
src2 and saturates them to signed 16-bit quantities. The 
signed^ 16-bit results are then packed into a signed packed 
16-bit format and written to dst. Specifically, the 
saturated 16-bit signed value of srcl is written to the 
upper half word of dst and the saturated 16'bit signed 
value of src2 is written to the lower half word of dst. 

Saturation is performed on each input value 
independently. The input values start as signed 32-bit 
quantities, and are saturated to 16-bit quantities 
according to the following rules: 

If the value is in the range - 2^^ to 2^^ - 1, inclusive, 
then no saturation is performed and the value is merely 
truncated to 16 bits. 

If the value is greater than 2^^ - i, then the result is 
set to 2^^ - 1. 

If the value is less than - 2^\ then the result is set 

to - 

This instruction is useful in code which manipulates 16- 
bit data at 32-bit precision for its intermediate steps; 
but which requires the final results to be in a 16-bit 
representation. The saturate step ensures that any values 
outside the signed 16-bit range are clamped to the high or 
low end of the range before being truncated to 16 bits. 

NOTE: This operation is performed on each 16-bit value 
separately. This instruction does not affect the SAT bit in 
the CSR. 
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S PACKU 4 ~ — 

Saturate and b .:k into Unsigned Packed 8-bit: The 
SPACKU4 instruction zakes four signed 16-bit values and 
saturates them tc unsigned 8-bit quantities. The values in 
srcl and src2 are treated as signed packed 16-bit 
quantities. The results are written into dst in an unsigned 
packed 8-bit format. 

Each signed 16-bit quantity in srcl and src2 is 
saturated to an unsigned 8-bit quantity as described below. 
The resulting quantities are then packed inio an unsigned 
packed 8-bit format. Specifically, the upper naif word of 
srcl is used to produce the most significant byte of dst. 
The lower half of srcl is used to produce the- second most 
significant byte (bits 16 to 23) of dst. The upper half 
word of src2 is used to produce the third most significant 
byte (bits 8 to 15) of dst. The lower half word of src2 is 
used to produce the least significant byte of dst. 

Saturation is performed on ■ each signed 16-bit input 
independently, producing separate unsigned 8-bit results. 
For each value, the following tests are applied: 

If the value is in the range 0 t.o 2 • ^ - 1 , inclusive, 
then no saturation is performed and the result is truncated 
to 8 bits. 

If the value is greater than 2^ - I, then the result is 
set to 2^ - i . 

If the value is less than 0, the result is set to 0. 
This instruction is useful in code which manipulates 8- 
bit data at 16-bit ^ precision for its intermediate steps, 
but which requires the final results to be in an 8-bit 
representation. The saturate step ensures that any values 
outside the unsigned 8-bit range are clamped to the high or 
low end of the range before being truncated to 8 bits. 

NOTE: This operation is performed on each 8-bic quanlity 
separately. This instruction does not affect the SAT bit in 
the CSR. 
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SSHVL 

Variable Shift Left, Signed: The SSHVL instruction 
shifts the signed 32-bit value in src2 to the left or right 
by the number of bits specified by srcl and places the 
rGSult in dst. 

The srcl argument is treaced as a 2s-complement shift 
value. Shift amounts that are larger than 31 are clamped 
to a shift amount of 31. Shift amounts that are smaller 
than -31 are clamped to a shift amount of -31. If srcl is 
positive, src2 is shifted to the left. If srcl is negative, 
src2 is shifted to the right by the absolute value of the 
shift amount with the sign-extended shifted value being 
placed in dst. It should also be noted that v/hen srcl is 
negative, the bits shifted to the right . past bit 0 are 
lost. 

NOTE: If the shifted value is saturated, then the SAT 
bit is set in CSR in the same cycle that the result is 
written. If the shifted value is not saturated, then the 
SAT bit is unaffected. 
SSHVR 

Variable Shift Right, Signed: The SSHVR instruction 
shifts the signed 32-bit value in src2 to the left or right 
by the number of bits specified by srci and places the 
result in dst. 

The srcl argument is treated as a 2s-complement shift 
value. Shift amounts that are larger than 31 are clamped to 
a shift amount of 31. Shift amounts that are smaller than 
-31 are clamped to a shift amount of -31. If srcl is 
positive, src2 is shifted to the right by the value 
specified with the sign-extended shifted value being placed 
in dst. It should also be noted that when srcl is positive, 
the bits shifted to the right past bit 0 are lost. If srcl 
is negative, src2 is shifted to the left by the absolute 
value of the shift amount value and the result is placed in 
dst . 
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STDW ■ — — 

Store Double Word: The STDW instruction stores a 64-bit 
quantity to memory from a oddreven register pair, srcd. 
Alignment to a 64-bit boundary is required. The effective 
memory address is formed from a base address register 
(baseR) and an optional offset that is either a register 
(offsetR) or a 5-bit unsigned constant (ucstS) 

STNDW ~ ' 

Store Non-Aligned Double Word: The STNDW instruction 
stores a 64-bit quantity to memory from a 64-bit register 
pair, srcd. The STNDW instruction may write a 64-bit value 
to any byte boundary. Thus alignment to a 64-bit boundary 
is not required. The effective memory address is formed 
from a base address register (baseR) and an optional offset 
that is either a register (offsetR) or a 5-bit unsigned 
constant (ucst5) , 

STNW ~ ■ " 

Store Non-Aligned Word: The STNW instruction' stores a 
32-bit quantity to memory from a 32-bit register, src. The 
STNW instruction may write a 32-bit value to any byte 
boundary. Thus alignment to a 32-bit boundary is not 
required. The effective memory address is formed from a 
base address register (baseR) and an optional offset that 
is either a register (offsetR) or a 5-bit unsigned constant 
(ucstS) . If an offset is not given, the assembler assigns 
an offset of zero. 
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SUB4 



Subtract without Saturation, Signed Packed 8-bit : The 
SUB4 performs 2s-complement subtraction between packed 8- 
bit quantities. The values in srci and src2 are treated as 
packed 8-bit quantities, and the results are written into 
dst in a packed 8-bit format. 

For each pair of 8-bit quantities in srci and src2, the 
difference between the 8-bit value from srci and the 8-bit 
value from src2 is calculated to produce an 8-bit result. 
The result is placed in the corresponding position in dst. 

Specifically, the difference between srci byteO and src2 
byteO is placed in byteO of dst. The difference between 
srci bytel and src2 bytel is placed in bytel of dst. The 
difference between srci byte2 and src2 byte2 is placed in 
byte2 of dst. The difference between srci byte3 and s:rc2 
byte3 is placed in byte3 of dst. 
No saturation is performed. 
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SUBABS4 

Subtract with Absolute Value, Unsigned Packed 8-bit : 
SUBABS4: The SUBABS4 instruction calculates the absolute, 
value of the differences between the packed 8-bit data 
contained in the source registers. The values in srcl and 
src2 are treated as unsigned packed 8-bit quantities. The 
result is written into dst in an unsigned packed 8-bit 
format . 

For each pair of unsigned 8-bit values in srcl and src2, 
the absolute value of the difference is calculated. This 
result is then placed in the corresponding position in dst. 

Specifically, the absolute value of the difference 
between srcl byteO and src2 byteO is placed in byteO of 
dst. The absolute value of the difference between srcl 
bytel and src2 bytel is placed in bytel of dst. The 
absolute value of the difference between srcl byte2 and 
src2, byte2 is placed in byte2 of dst. And the absolute 
value of the difference between srcl byte3 and src2 byte3 
is placed in byte3 of dst. 
SWAP2 

Swap Half-words in Each Word (Pseudo-Operation) : The 
SWAP2 is a pseudo-operation that takes the lower half-word 
from src2 and places it in the upper half-word of dst while 
the upper-half word from src2 is placed in the lower half- 
word of dst. It assembles as PACKLH2 src^src, dst. 

This instruction is useful for manipulating and 
preparing pairs of 16-bit., values to be used by the packed 
arithmetic operations, such as ADD2 . 

The SWAP2 instruction can be used in conjunction with 
the SWAP4 instruction to change the byte ordering (and 
therefore, the endianness) of 32-bit data. 
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SWAP4 

Swap Bytes in Each Half-word: The SWAP4 instruction 
exchanges pairs of bytes within each half-word of src2, 
placing the result in dst. The values in src2 are treated 
as packed 8-bit values. 

Specifically, the upper byte in the upper half-word is 
placed in the lower byte in the upper halfword while the 
lower byte of the upper half-word is placed in the upper 
byte of the upper half-word. Also the upper byte in the 
lower half-word is placed in the lower byte of the lower 
half-word while the lower byte in the lower half-word is 
placed in the upper byte of the lower half word. 

By itself, this instruction changes the ordering of 
bytes within half words. This effectively changes the 
endianness of 16-bit data packed in 32-bit words. .The 
endianness of full 32-bit quantities can be changed' by 
using the SWAP4 instruction in conjunction with the SWAP2 
instruction • 
UNPKHU4 " ~~ 

Unpack High Unsigned Packed 8-bit to Unsigned Packed''16- 
bit: The UNPKHU4 instruction moves the two most significant 
bytes of src2 into the two low bytes of the two half-words 
of dst. 

Specifically, the upper byte in the upper half-word is 
placed in the lower byte in the upper halfword while the 
lower byte of the upper half-word is placed in the lower 
byte of the lower half-word. The src2 bytes are zero- 
extended when unpacked, filling the two high bytes of the 
two half-words of dst with zeros. 
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UNPKLU4 

Unpack Low Unsigned Packed 8-bit to Unsigned Packed 16- 
bit: The UNPKLU4 instruction moves the two least 
significant bytes of src2 into the two low bytes of the two 
half-words of dst. 

Specifically, the upper byte in the lower half-word is 
placed in the lower byte in the upper halfword while the 
lower byte of the lower half-word is kept in the lower byte 
of" the lower half-word. The src2 bytes are zero-extended 
when unpacked^ filling the two high bytes of the two half- 
words of dst with zeros. 
XPND2 

Expand Bits to Packed 16-bit Masks: The XPND2 
instruction reads the two least-significant bits of src2 
and expands them into two half-word masks written to dst. 
Bit 1 of src2 is. replicated and placed in the upper half- 
word of dst. Bit 0 of src2 is replicated and placed in the 
lower half-word of dst. Bits 2 through 31 of src2 are 
ignored . 

This instruction is useful when combined with the output 
of CMPGT2 or CiyiPEQ2, for generating a mask that corresponds 
to the individual half-word positions that were compared. 
That mask may then be used with ANDN, AND or OR 
instructions to perform other operations such as 
compositing. 
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XPND4 

Expand Bits to Packed 8-bit Masks: The XPND4 instruction 
reads the four least-significant bits of src2 and expands 
them into four byte masks written to dst. Bit 0 of src2 is 
replicated and placed in the least significant . byte of dst. 
Bit 1 of src2 is replicated and placed in second least 
significant byte of dst. Bit 2 of src2 is replicated and 
placed in second most significant byte of dst. Bit 3 of 
src2 is replicated and placed in most significant byte of 
dst. Bits 4 through 31 of src2 are ignored. 

This instruction is useful when combined with the output 
of CiyiPGT4 or CMPEQ4, for generating a mask that corresponds 
to the individual byte positions that were compared. That 
mask may then be used with ANDN, AND or OR instructions to 
perform other operations such as compositing. 



30 [0137] As an example of the power of the improved instruction set architecture of the present embodiment, a dot- 
product example will now be described that illustrates a unification of the following concepts: 
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- VLIW ^; 

Software Pipelining 

High level Vector Optimizations to transform the code 
Non-aligned memory accesses 
Split-ALU operations 

[0138] Table 26 is a listing of a dot-product routine written in the know programming language code. C. Two arrays 
are involved, pointed to by two pointers. bvPtr and vecPtr. Each element pair are multiplied together and added to a 
running total. 
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Table 26 - Dot Product, Natural Version of C-code 



dot_vselp_cn (const short ^ RESTRICT bvPtr, const 

short * RESTRICT vecPtr) 
{ 

int i, psi = 0; 

WORD_ALIGNED(bvPtr) ; 
WORD_ALIGNED(vecPtr) ; 

for (i = 0, psi = 0; i < 40; i++) 
psi bvPtr[i] ^ vGcPtr[i]; 

return psi; 



[0139] Notice that a "WORD_ALIGNED" directive is included to ■'tell" an optimizing compiler that the pointers bvPTR 
and vecPTR are aligned in order to enable the split-alu operations of the present invention. In order to take advantage 
of the split ALU operations of the present embodiment, the optimizer will unroll the loop and load two 1 Sbit values into 
a 32bit register, as shown in Table 27. 
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Table 27- Creating two pointers for packed data 



int dot_vselp_c (const int ^ RESTRICT bvPtr, const int * 

RESTRICT vecPtr) 

{ 

int i, psil = 0, psi2 - 0; 

WORD_ALIGNED(bvPtr) ; 
WORD_ALIGNED(vecPtr) ; 



[0140] Notice in Figure 27 that bvPTR and vecPTR are now int*. so it can be assumed that they point to 16bit arrays 
that are aligned/packed on 32bit boundaries. Without this knowledge, which was explicitly provided in the code of Table 
26. the vector transformation optimization shown in Table 28 cannot be done to allow two multiply-accumulate (mac) 
operations to be performed during each instruction execution phase. 
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Table 28 - Vector transformation 





for (i = 0; 


i < 20; i++) 






{ 

ps i 1 + = 


mpy (bvPtr [ i ] ^ 


vecPtr [ i] ) ; 




psi2 += 

} 


mpyh (bvPtr [ i ] , 


vecPtr [ i] ) ; 




return psil 


+ psi2; 




} 









[0141] With a non-aligned load double word (Iddw) instruction of the present embodiment, alignment is not an issue, 
20 as shown in Table 29. 

Table 29 - Dot Product, improved version with non-aligned 
25 capability 

Tnt dot_vselp_c (const short ' * RESTRICT bvPtr, const I : 
short * RESTRICT vecPtr) 

•i 

int i, psil 0, psi2 = 0; ; 
for (i =0; i < 20; i++) 

{ . ' 

[0142] The non-aligned load/store operations on the present embodiment use both D-units. so even though 64-bits 
40 are being read at a time, it still takes two cycles to perform this loop. In this case, four mac operation are performed in 
two cycles for a net throughput of two mac/cycle, as shown in Table 30. Advantageously, the compiler/optimizer does 
not have to know anything about alignment. 
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Table 30 - Optimized Dot Product Code 



^"^^^ tl = _mem64(bvPtrTTm read n^K^Ti^K^ 

64bits *' / 

int64 t2 = _mem64{vecPtr[i]); /* read nonaligned 
64bits */ 

int64 psil += _mpy2(hi(tl) , hi(t2)); 
int64 psi2 += _mpy2 ( lo ( t 1 ) , lo{t2)); 



return 
_lo (psi2) ; 
} 



_hi (psii: 



_lo (psil ) 



.hi(psi2) + 



[01431 Advantageously, the split ALU Instructions combined with non-aligned load/store Instructions of the present 
embodiment enable more loops to have the two mac/cycle throughput. Furthermore, if the optimizer can determine if 
bvPtr and vecPtr are aligned, then the code can be optimized further to provide four mac/cycle throughput Advanta- 
geously, the instruction set architecture of the present embodiment simplifies the programming model by enabling 
higher performance of natural C code by a combination of the concepts listed above. 

Other System Examples 

[0144] Several example systems which can benefit from aspects of the present Invention are described in U S Patent 
5,072.418. which was Incorporated by reference herein, particularly with reference to Figures 2-18 of U S Patent 
5,072.418. A microprocessor incorporating an aspect of the present invention to Improve performance or reduce cost 
can be used to further Improve the systems described In U.S. Patent 5.072,418. Such systems include but are not 
limited to. industnal process controls, automotive vehicle systems, motor controls, robotic control systems satellite 
telecommunication systems, echo canceling systems, modems, video imaging systems, speech recognition systems 
vocoder-modem systems with encryption, and such. 

[0145] For example, digital system 1 can be Included in a mobile telecommunications device, such as a wireless 
telephone with integrated keyboard and display. Digital system 1 with processor 1 0 is connected to the keyboard where 
appropriate via a keyboard adapter, to the display, where appropriate via a display adapter, and to radio frequency 
(RF) circuitry. The RF circuitry is connected to an aerial. Advantageously, the SIMD instruction set along with non- 
aligned load/store operations allows more complex algorithms to be performed in a given amount of time so that the 
wireless telephone can provide improved communication services. 

[0146] Fabrication of digital system 10 Involves multiple steps of implanting various amounts of impurities into a 
semiconductor substrate and diffusing the Impurities to selected depths within the substrate to form transistor devices 
Masks are formed to control the placement of the impurities. Multiple layers of conductive material and insulatlve 
matenal are deposited and etched to Interconnect the various devices. These steps are performed in a clean room 
environment. 

[0147] A significant portion of the cost of producing the data processing device involves testing. While In wafer form 
individual devices are biased to an operational state and probe tested for basic operational functionality The wafer is 
then separated into individual dice which may be sold as bare die or packaged. After packaging, finished parts are 
biased into an operational state and tested for operational functionality 

[0148] Thus, a digital system is provided with a processor having an improved instruction set architecture The proc- 
essor IS code-compatible with C62xx DSP processors from Texas Instruments Incorporated. It provides a superset of 
the C62x architecture while providing complete code compatibility for existing C62x code. The processor provides 
extensions to the existing C62x architecture in several areas: register file enhancements, data path extensions addi- 
tional functional unit hardware, increased orthogonality of the instruction set, data flow enhancements 8-bit and 16-bit 
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extensions, and additional instructions that reduce code size and Increase register flexibility. 

[0149] Advantageously, each .M unit can perfornntwo 16x16 bit nnultiplies orfour 8x8 bit nnultiplies every clock cycle. 
The .D units can access words and double words on any byte boundary by using non-aligned load and store instructions. 
The -L unit can perform byte shifts and the .M unit can perform bi-directional variable shifts in addition to the .S unit's 

5 ability to do shifts. The bi-directional shifts directly assist voice-compression codecs (vocoders). Special communica- 
tions-specific instructions, such as SHFL. DEAL and GMPY4 have been added to the .M unit to address common 
operations in error-correcting codes. Bit-count. Bit-Reverse and Rotate hardware on the .M unit extends support for 
bit-level algorithms such as binary morphology, image metric calculations and encryption algorithms. 
[0150] Advantageously, in an embodiment of the present invention, increased orthogonality of the Instruction Set 

10 Architecture is provided: the .D unit can perform 32-bit logical instructions in addition to the .S and .L units; the .L, and . 
D units can be used to load 5-bit constants in addition to the .S unit's ability to load 16-bit constants; the .L unit can 
perform byte shifts and the .M unit can perform bi-directional variable shifts in addition to the .8 unit's ability to do shifts; 
the bi-directional shifts directly assist voice-compression codecs (vocoders). 

[01 51] Advantageously, an extensive collection of PACK and byte shift instructions simplifies manipulation of packed 
15 data types. Single Instruction. Multiple Data (SIMD) instructions have been added that operate directly on packed data 
to streamline data flow and increase instruction set efficiency. The processor has a comprehensive collection of 8-bit 
and 16-bit instruction set extensions. 

[0152] Advantageously, additional instructions are provided that reduce code size and increase register flexibility. 
BDEC and BROS combine a branch instruction with the decrement/test positive of a destination register respectively. 

20 These instructions help reduce the numberof instructions needed to decrement a loop counter and conditionally branch 
based upon the value of that counter. Any register can be used as the loop counter which can free the standard con- 
ditional registers (A0-A2 and B0-B2) for other uses. The ADDKPC instruction helps reduce the number of instructions 
needed to set up the return address for a function call. The BNOP instruction helps reduce the number of instructions 
required to perform a branch when NOPs are needed to fill the delay slots of a branch. 

25 [0153] Advantageously, in another embodiment of the present invention, there are no execute packet bo^undary re- 
strictions, thereby eliminating a need to pad a fetch packet by adding unneeded NOP instructions. 
[01 54] In an embodiment of the present invention, there are 64 general-purpose registers. General purpose-registers 
AO, A1. A2. BO, B1 and B2 can each be used as a conditional register. v 
[01 55] In another embodiment of the present invention, each .D unit can load and store double words (64 bjts). The . 

30 D units can access words and double words on any byte boundary. The .D unit supports data as well as address cross 
paths. The same register can be used as a data path cross operand for more than one functional unit in an execute 
packet. A delay clock cycle is introduced whenever an instruction attempts to read a register via across path that was 
updated in the previous cycle. Up to two long sources and two long results can be accessed on each data path every 
cycle. 

35 [0156] As used herein, the terms "applied." "connected." and "connection" mean electrically connectedvincluding 
where additional elements may be in the electrical connection path. "Associated" means a controlling relationship, 
such as a memory resource that is controlled by an associated port. The terms assert, assertion. de-assert..de-asser- 
tion. negate and negation are used to avoid confusion when dealing with a mixture of active high and active low signals. 
Assert and assertion are used to indicate that a signal is rendered active, or logically true. De-assert, de-assertion. 

40 negate, and negation are used to indicate that a signal is rendered inactive, or logically false. 

[0157] While the invention has been described with reference to illustrative embodiments, this description is not 
intended to be construed in a limiting sense. Various other embodiments of the invention will be apparent to persons 
skilled in the art upon reference to this description. It is therefore contemplated that the appended claims will cover 
any such modifications of the embodiments as fall within the true scope of the invention. 

45 

Claims 

1. A microprocessor having an instruction execution pipeline with a plurality of pipeline phases, comprising: 

50 

program fetch circuitry operable to fetch instructions; 

instruction decode circuitry connected to receive fetched instructions from the program fetch circuitry, the 
instruction decode circuitry operable to decode a set of single instruction, multiple data (SIMD) instructions; and 
at least a first functional unit connected to receive control signals from the instruction decode circuitry, wherein 
55 the first functional unit comprises: 

multi-field arithmetic/logic unit (ALU) circuitry connected to receive a plurality of source operands and having 
outputs connected to provide a multi-field result as a destination operand-in response to control signals from 
the instruction decode circuitry, wherein the multi-field ALU circuitry is operable to treat the plurality of source 
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operands as a set of N1 fields, such that the multi-field result includes N1 results corresponding to the set of 
N1 fields; and 

wherein the first functional unit is operable to provide the multi-field result in response to a single SIMD in- 
struction of a first type. 

The microprocessor of claim 1, further comprising a second functional unit connected to operate in parallel with 
the first functional unit, wherein the second functional unit comprises multi-field multiplication circuitry connected 
to receive a plurality of source operands and having outputs connected to provide a multi-field result as a destination 
operand in response to control signals from the instruction decode circuitry, wherein the multi-field multiplication 
circuitry is operable to treat the plurality of source operands as a set of N2 fields, such that the multi-field result 
includes N2 results corresponding to the set of N2 fields; and 

wherein the second functional unit is operable to provide the multi-field result in response to a single SIMD 
instruction of a second type. 

The microprocessor of claim 1 or claim 2, further comprising: 

at least a first load/store unit connected to at least a first storage register of the microprocessor; 

a memory subsystem having at least a first memory port connected to the first load/store unit; 

address generation circuitry in the first load/store unit having a first address output connected to the first 

memory port, the address generation circuitry operable to provide a first byte address on the first address 

output; 

an extraction circuit connected to the first memory port, wherein the extraction circuit is operable to provide a 
first non-aligned multi-byte data item to the first load/store unit responsive to the first byte address; and wherein 
the load/store unit is operable to provide the multi-byte data item in response to a single instruction of a third 
type. 

The microprocessor of claim 3. wherein the first storage register is part of a register file connected to the first 
functional unit and to the second functional unit for providing the plurality of source operands and connected to 
the first functional unit and to the second function unit to receive the destination operands. 

The microprocessor of Claim 3 or claim 4. wherein the instructions of the first type, second type and third type all 
have a field for identifying a predicate register 

A digital system including a microprocessor as claimed in any one of claims 1 to 5. 

A digital system as claimed in claim 6. which is a cellular telephone, further comprising: 

an integrated keyboard connected to the microprocessor via a keyboard adapter; 
a display, connected to the microprocessor via a display adapter; 
radio frequency (RF) circuitry connected to the microprocessor; and 
an aerial connected to the RF circuitry. 

A method of operating a microprocessor, comprising the steps of: 

providing multi-field arithmetic/logic unit (ALU) circuitry for operating on a set of source operands to form a 
multi-field destination operand by treating the plurality of source operands as a set of N1 fields, such that the 
multi-field result includes N1 results corresponding to the set of N1 fields; and 

providing a set of single instruction, multiple data (SIMD) instructions to direct the operation of the multi-field 
ALU circuitry. 

A method of operating a microprocessor, comprising the steps of: 

providing multi-field arithmetic/logic unit (ALU) circuitry for operating on a set of source operands to form a 
multi-field destination operand by treating the plurality of source operands as a set of N1 fields, such that the 
multi-field result includes N1 results corresponding to the set of N1 fields; 

providing multi-field multiplication circuitry for operating on a set of source operands to form a multi-field des- 
tination operand by treating the plurality of source operands as a set of N2 fields, such that the multi-field result 
includes N2 results corresponding to the set of N2 fields: and 
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providing a set of single instruction, multiple data (SIMD) instructions to direct the operation of the multi-field 
ALU circuitry and the multi-field multiplication circuitry. 

The method of claim 8 or claim 9. further comprising the step of providing load/store circuitry operable to transfer 
a multi-byte data item to a memory circuit at a byte address that is not word aligned. 
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